Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
241 changes: 241 additions & 0 deletions docs/transcript.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# Transcript Formats

The Reflector API provides multiple output formats for transcript data through the `transcript_format` query parameter on the GET `/v1/transcripts/{id}` endpoint.

## Overview

When retrieving a transcript, you can specify the desired format using the `transcript_format` query parameter. The API supports four formats optimized for different use cases:

- **text** - Plain text with speaker names (default)
- **text-timestamped** - Timestamped text with speaker names
- **webvtt-named** - WebVTT subtitle format with participant names
- **json** - Structured JSON segments with full metadata

All formats include participant information when available, resolving speaker IDs to actual names.

## Query Parameter Usage

```
GET /v1/transcripts/{id}?transcript_format={format}
```

### Parameters

- `transcript_format` (optional): The desired output format
- Type: `"text" | "text-timestamped" | "webvtt-named" | "json"`
- Default: `"text"`

## Format Descriptions

### Text Format (`text`)

**Use case:** Simple, human-readable transcript for display or export.

**Format:** Speaker names followed by their dialogue, one line per segment.

**Example:**
```
John Smith: Hello everyone
Jane Doe: Hi there
John Smith: How are you today?
```

**Request:**
```bash
GET /v1/transcripts/{id}?transcript_format=text
```

**Response:**
```json
{
"id": "transcript_123",
"name": "Meeting Recording",
"transcript_format": "text",
"transcript": "John Smith: Hello everyone\nJane Doe: Hi there\nJohn Smith: How are you today?",
"participants": [
{"id": "p1", "speaker": 0, "name": "John Smith"},
{"id": "p2", "speaker": 1, "name": "Jane Doe"}
],
...
}
```

### Text Timestamped Format (`text-timestamped`)

**Use case:** Transcript with timing information for navigation or reference.

**Format:** `[MM:SS]` timestamp prefix before each speaker and dialogue.

**Example:**
```
[00:00] John Smith: Hello everyone
[00:05] Jane Doe: Hi there
[00:12] John Smith: How are you today?
```

**Request:**
```bash
GET /v1/transcripts/{id}?transcript_format=text-timestamped
```

**Response:**
```json
{
"id": "transcript_123",
"name": "Meeting Recording",
"transcript_format": "text-timestamped",
"transcript": "[00:00] John Smith: Hello everyone\n[00:05] Jane Doe: Hi there\n[00:12] John Smith: How are you today?",
"participants": [
{"id": "p1", "speaker": 0, "name": "John Smith"},
{"id": "p2", "speaker": 1, "name": "Jane Doe"}
],
...
}
```

### WebVTT Named Format (`webvtt-named`)

**Use case:** Subtitle files for video players, accessibility tools, or video editing.

**Format:** Standard WebVTT subtitle format with voice tags using participant names.

**Example:**
```
WEBVTT

00:00:00.000 --> 00:00:05.000
<v John Smith>Hello everyone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example of multiple speakers at the same time could be beneficial here


00:00:05.000 --> 00:00:12.000
<v Jane Doe>Hi there

00:00:12.000 --> 00:00:18.000
<v John Smith>How are you today?
```

**Request:**
```bash
GET /v1/transcripts/{id}?transcript_format=webvtt-named
```

**Response:**
```json
{
"id": "transcript_123",
"name": "Meeting Recording",
"transcript_format": "webvtt-named",
"transcript": "WEBVTT\n\n00:00:00.000 --> 00:00:05.000\n<v John Smith>Hello everyone\n\n...",
"participants": [
{"id": "p1", "speaker": 0, "name": "John Smith"},
{"id": "p2", "speaker": 1, "name": "Jane Doe"}
],
...
}
```

### JSON Format (`json`)

**Use case:** Programmatic access with full timing and speaker metadata.

**Format:** Array of segment objects with speaker information, text content, and precise timing.

**Example:**
```json
[
{
"speaker": 0,
"speaker_name": "John Smith",
"text": "Hello everyone",
"start": 0.0,
"end": 5.0
},
{
"speaker": 1,
"speaker_name": "Jane Doe",
"text": "Hi there",
"start": 5.0,
"end": 12.0
},
{
"speaker": 0,
"speaker_name": "John Smith",
"text": "How are you today?",
"start": 12.0,
"end": 18.0
}
]
```

**Request:**
```bash
GET /v1/transcripts/{id}?transcript_format=json
```

**Response:**
```json
{
"id": "transcript_123",
"name": "Meeting Recording",
"transcript_format": "json",
"transcript": [
{
"speaker": 0,
"speaker_name": "John Smith",
"text": "Hello everyone",
"start": 0.0,
"end": 5.0
},
{
"speaker": 1,
"speaker_name": "Jane Doe",
"text": "Hi there",
"start": 5.0,
"end": 12.0
}
],
"participants": [
{"id": "p1", "speaker": 0, "name": "John Smith"},
{"id": "p2", "speaker": 1, "name": "Jane Doe"}
],
...
}
```

## Response Structure

All formats return the same base transcript metadata with an additional `transcript_format` field and format-specific `transcript` field:

### Common Fields

- `id`: Transcript identifier
- `user_id`: Owner user ID (if authenticated)
- `name`: Transcript name
- `status`: Processing status
- `locked`: Whether transcript is locked for editing
- `duration`: Total duration in seconds
- `title`: Auto-generated or custom title
- `short_summary`: Brief summary
- `long_summary`: Detailed summary
- `created_at`: Creation timestamp
- `share_mode`: Access control setting
- `source_language`: Original audio language
- `target_language`: Translation target language
- `reviewed`: Whether transcript has been reviewed
- `meeting_id`: Associated meeting ID (if applicable)
- `source_kind`: Source type (live, file, room)
- `room_id`: Associated room ID (if applicable)
- `audio_deleted`: Whether audio has been deleted
- `participants`: Array of participant objects with speaker mappings

### Format-Specific Fields

- `transcript_format`: The format identifier (discriminator field)
- `transcript`: The formatted transcript content (string for text/webvtt formats, array for json format)

## Speaker Name Resolution

All formats resolve speaker IDs to participant names when available:

- If a participant exists for the speaker ID, their name is used
- If no participant exists, a default name like "Speaker 0" is generated
- Speaker IDs are integers (0, 1, 2, etc.) assigned during diarization
17 changes: 17 additions & 0 deletions server/reflector/schemas/transcript_formats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
"""Schema definitions for transcript format types and segments."""

from typing import Literal

from pydantic import BaseModel

TranscriptFormat = Literal["text", "text-timestamped", "webvtt-named", "json"]


class TranscriptSegment(BaseModel):
"""A single transcript segment with speaker and timing information."""

speaker: int
speaker_name: str
text: str
start: float
end: float
121 changes: 121 additions & 0 deletions server/reflector/utils/transcript_formats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
"""Utilities for converting transcript data to various output formats."""

import webvtt

from reflector.db.transcripts import TranscriptParticipant, TranscriptTopic
from reflector.processors.types import Transcript as ProcessorTranscript
from reflector.processors.types import words_to_segments
from reflector.schemas.transcript_formats import TranscriptSegment
from reflector.utils.webvtt import _seconds_to_timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why I'd underscore it. Probably if it is used outside of the file we can remove _



def get_speaker_name(
speaker: int, participants: list[TranscriptParticipant] | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • get_speaker_name doesn't seem to make sense with arguments get_speaker_name(n, None) - I'd filter out the "None" at the level of the callers - it's their problem

  • list[TranscriptParticipant] data structure adds much ambiguity - what happens with data (1, [{speaker: 1, name: "Bob"}, {speaker: 1, name: "Evil Bob"}] passed into this function? Is the name Bob or Evil Bob
    I suggest either

  • more difficult: change participants format to a mapping {participantIndex: {participant meta}} (which easily could be an Array, too) - to disambiguate potential double participants

  • easier: since it's already O(n) complexity also check for dupes here and (option a: warn, option b: crash) // the presence of the check will serve as documentation by itself, answering the question "what do we do with broken data"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for the None. For the rest, there is no multiple speaker/participant, the whole point of speaker id is that it is supposed to be unique for a person/speaker. That's what diarization is for. So no, there is no multiple speaker assigned to the same portion of words.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you raise the point about broken data or such PRIOR the call, it's not when you start rendering that you need to care about broken data. Not even starting to emit a format that can work around broken data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read in isolation, list[TranscriptParticipant] without any extra info piece (either encoded in type structure or semantic, or in runtime assertions), documents that duplicated participants are a valid state of the system

only us being happen to remember that it is not the case makes it readable enough to us

take it or leave it, I'm not insisting, but that's what I see

) -> str:
"""Get participant name for speaker or default to 'Speaker N'."""
if participants:
for participant in participants:
if participant.speaker == speaker:
return participant.name
return f"Speaker {speaker}"


def format_timestamp_mmss(seconds: float) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: : int could look better than : float here because float part is ignored anyways (and why'd we also need float seconds)

"""Format seconds as MM:SS timestamp."""
minutes = int(seconds // 60)
secs = int(seconds % 60)
return f"{minutes:02d}:{secs:02d}"


def transcript_to_text(
topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question to list[TranscriptParticipant] | None as before - e.g., what's the semantic difference between None and empty list here - the type hints that there IS distinction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And participants can be empty, per historical data. Not empty list, just None not filled. It was easier to do it there rather than changing all callers, and handle the special case of None and pass empty list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are enabling the historical development process to leak into new code this way. This stuff compounds IMO

) -> str:
"""Convert transcript topics to plain text with speaker names."""
lines = []
for topic in topics:
if not topic.words:
continue

transcript = ProcessorTranscript(words=topic.words)
segments = transcript.as_segments()

for segment in segments:
speaker_name = get_speaker_name(segment.speaker, participants)
text = segment.text.strip()
lines.append(f"{speaker_name}: {text}")

return "\n".join(lines)


def transcript_to_text_timestamped(
topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None
) -> str:
"""Convert transcript topics to timestamped text with speaker names."""
lines = []
for topic in topics:
if not topic.words:
continue

transcript = ProcessorTranscript(words=topic.words)
segments = transcript.as_segments()

for segment in segments:
speaker_name = get_speaker_name(segment.speaker, participants)
timestamp = format_timestamp_mmss(segment.start)
text = segment.text.strip()
lines.append(f"[{timestamp}] {speaker_name}: {text}")

return "\n".join(lines)


def topics_to_webvtt_named(
topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None
) -> str:
"""Convert transcript topics to WebVTT format with participant names."""
vtt = webvtt.WebVTT()

for topic in topics:
if not topic.words:
continue

segments = words_to_segments(topic.words)

for segment in segments:
speaker_name = get_speaker_name(segment.speaker, participants)
text = segment.text.strip()
text = f"<v {speaker_name}>{text}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave // https://www.w3.org/TR/webvtt1/#introduction-caption comment for documentation completeness; it's a standard and not going to change re: this code, and it's nice to reinforce the turst of the reader that we don't invent anything here (I as a reader may suspect so seeing string raw concatenation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, it's standard of Webvtt, why putting it here ? I mean we don't do that on every part of the implementation (like the timestamp formatting, the -->, why this tag specifically ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just make sure it's reasonably obvious thorough any place where raw string concatenation is done. Doesn't have to be specifically on this line. If no such contextually appropriate place is out there e.g. the format is implemented across different files, I'd just add the comment everywhere the format is being implementad


caption = webvtt.Caption(
start=_seconds_to_timestamp(segment.start),
end=_seconds_to_timestamp(segment.end),
text=text,
)
vtt.captions.append(caption)

return vtt.content


def transcript_to_json_segments(
topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to comments about list[TranscriptParticipant] | None above - maybe None is the case when "we can't infer participants", in that case it could benefit to have | None in some contextual-related code scopes, but I'd pose the question "why haven't we failed the pipeline by the point we can't get participants"

) -> list[TranscriptSegment]:
"""Convert transcript topics to a flat list of JSON segments."""
segments = []

for topic in topics:
if not topic.words:
continue

transcript = ProcessorTranscript(words=topic.words)
for segment in transcript.as_segments():
speaker_name = get_speaker_name(segment.speaker, participants)
segments.append(
TranscriptSegment(
speaker=segment.speaker,
speaker_name=speaker_name,
text=segment.text.strip(),
start=segment.start,
end=segment.end,
)
)

return segments
Loading