-
Couldn't load subscription status.
- Fork 0
feat: add transcript format parameter to GET endpoint #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,241 @@ | ||
| # Transcript Formats | ||
|
|
||
| The Reflector API provides multiple output formats for transcript data through the `transcript_format` query parameter on the GET `/v1/transcripts/{id}` endpoint. | ||
|
|
||
| ## Overview | ||
|
|
||
| When retrieving a transcript, you can specify the desired format using the `transcript_format` query parameter. The API supports four formats optimized for different use cases: | ||
|
|
||
| - **text** - Plain text with speaker names (default) | ||
| - **text-timestamped** - Timestamped text with speaker names | ||
| - **webvtt-named** - WebVTT subtitle format with participant names | ||
| - **json** - Structured JSON segments with full metadata | ||
|
|
||
| All formats include participant information when available, resolving speaker IDs to actual names. | ||
|
|
||
| ## Query Parameter Usage | ||
|
|
||
| ``` | ||
| GET /v1/transcripts/{id}?transcript_format={format} | ||
| ``` | ||
|
|
||
| ### Parameters | ||
|
|
||
| - `transcript_format` (optional): The desired output format | ||
| - Type: `"text" | "text-timestamped" | "webvtt-named" | "json"` | ||
| - Default: `"text"` | ||
|
|
||
| ## Format Descriptions | ||
|
|
||
| ### Text Format (`text`) | ||
|
|
||
| **Use case:** Simple, human-readable transcript for display or export. | ||
|
|
||
| **Format:** Speaker names followed by their dialogue, one line per segment. | ||
|
|
||
| **Example:** | ||
| ``` | ||
| John Smith: Hello everyone | ||
| Jane Doe: Hi there | ||
| John Smith: How are you today? | ||
| ``` | ||
|
|
||
| **Request:** | ||
| ```bash | ||
| GET /v1/transcripts/{id}?transcript_format=text | ||
| ``` | ||
|
|
||
| **Response:** | ||
| ```json | ||
| { | ||
| "id": "transcript_123", | ||
| "name": "Meeting Recording", | ||
| "transcript_format": "text", | ||
| "transcript": "John Smith: Hello everyone\nJane Doe: Hi there\nJohn Smith: How are you today?", | ||
| "participants": [ | ||
| {"id": "p1", "speaker": 0, "name": "John Smith"}, | ||
| {"id": "p2", "speaker": 1, "name": "Jane Doe"} | ||
| ], | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| ### Text Timestamped Format (`text-timestamped`) | ||
|
|
||
| **Use case:** Transcript with timing information for navigation or reference. | ||
|
|
||
| **Format:** `[MM:SS]` timestamp prefix before each speaker and dialogue. | ||
|
|
||
| **Example:** | ||
| ``` | ||
| [00:00] John Smith: Hello everyone | ||
| [00:05] Jane Doe: Hi there | ||
| [00:12] John Smith: How are you today? | ||
| ``` | ||
|
|
||
| **Request:** | ||
| ```bash | ||
| GET /v1/transcripts/{id}?transcript_format=text-timestamped | ||
| ``` | ||
|
|
||
| **Response:** | ||
| ```json | ||
| { | ||
| "id": "transcript_123", | ||
| "name": "Meeting Recording", | ||
| "transcript_format": "text-timestamped", | ||
| "transcript": "[00:00] John Smith: Hello everyone\n[00:05] Jane Doe: Hi there\n[00:12] John Smith: How are you today?", | ||
| "participants": [ | ||
| {"id": "p1", "speaker": 0, "name": "John Smith"}, | ||
| {"id": "p2", "speaker": 1, "name": "Jane Doe"} | ||
| ], | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| ### WebVTT Named Format (`webvtt-named`) | ||
|
|
||
| **Use case:** Subtitle files for video players, accessibility tools, or video editing. | ||
|
|
||
| **Format:** Standard WebVTT subtitle format with voice tags using participant names. | ||
|
|
||
| **Example:** | ||
| ``` | ||
| WEBVTT | ||
|
|
||
| 00:00:00.000 --> 00:00:05.000 | ||
| <v John Smith>Hello everyone | ||
|
|
||
| 00:00:05.000 --> 00:00:12.000 | ||
| <v Jane Doe>Hi there | ||
|
|
||
| 00:00:12.000 --> 00:00:18.000 | ||
| <v John Smith>How are you today? | ||
| ``` | ||
|
|
||
| **Request:** | ||
| ```bash | ||
| GET /v1/transcripts/{id}?transcript_format=webvtt-named | ||
| ``` | ||
|
|
||
| **Response:** | ||
| ```json | ||
| { | ||
| "id": "transcript_123", | ||
| "name": "Meeting Recording", | ||
| "transcript_format": "webvtt-named", | ||
| "transcript": "WEBVTT\n\n00:00:00.000 --> 00:00:05.000\n<v John Smith>Hello everyone\n\n...", | ||
| "participants": [ | ||
| {"id": "p1", "speaker": 0, "name": "John Smith"}, | ||
| {"id": "p2", "speaker": 1, "name": "Jane Doe"} | ||
| ], | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| ### JSON Format (`json`) | ||
|
|
||
| **Use case:** Programmatic access with full timing and speaker metadata. | ||
|
|
||
| **Format:** Array of segment objects with speaker information, text content, and precise timing. | ||
|
|
||
| **Example:** | ||
| ```json | ||
| [ | ||
| { | ||
| "speaker": 0, | ||
| "speaker_name": "John Smith", | ||
| "text": "Hello everyone", | ||
| "start": 0.0, | ||
| "end": 5.0 | ||
| }, | ||
| { | ||
| "speaker": 1, | ||
| "speaker_name": "Jane Doe", | ||
| "text": "Hi there", | ||
| "start": 5.0, | ||
| "end": 12.0 | ||
| }, | ||
| { | ||
| "speaker": 0, | ||
| "speaker_name": "John Smith", | ||
| "text": "How are you today?", | ||
| "start": 12.0, | ||
| "end": 18.0 | ||
| } | ||
| ] | ||
| ``` | ||
|
|
||
| **Request:** | ||
| ```bash | ||
| GET /v1/transcripts/{id}?transcript_format=json | ||
| ``` | ||
|
|
||
| **Response:** | ||
| ```json | ||
| { | ||
| "id": "transcript_123", | ||
| "name": "Meeting Recording", | ||
| "transcript_format": "json", | ||
| "transcript": [ | ||
| { | ||
| "speaker": 0, | ||
| "speaker_name": "John Smith", | ||
| "text": "Hello everyone", | ||
| "start": 0.0, | ||
| "end": 5.0 | ||
| }, | ||
| { | ||
| "speaker": 1, | ||
| "speaker_name": "Jane Doe", | ||
| "text": "Hi there", | ||
| "start": 5.0, | ||
| "end": 12.0 | ||
| } | ||
| ], | ||
| "participants": [ | ||
| {"id": "p1", "speaker": 0, "name": "John Smith"}, | ||
| {"id": "p2", "speaker": 1, "name": "Jane Doe"} | ||
| ], | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| ## Response Structure | ||
|
|
||
| All formats return the same base transcript metadata with an additional `transcript_format` field and format-specific `transcript` field: | ||
|
|
||
| ### Common Fields | ||
|
|
||
| - `id`: Transcript identifier | ||
| - `user_id`: Owner user ID (if authenticated) | ||
| - `name`: Transcript name | ||
| - `status`: Processing status | ||
| - `locked`: Whether transcript is locked for editing | ||
| - `duration`: Total duration in seconds | ||
| - `title`: Auto-generated or custom title | ||
| - `short_summary`: Brief summary | ||
| - `long_summary`: Detailed summary | ||
| - `created_at`: Creation timestamp | ||
| - `share_mode`: Access control setting | ||
| - `source_language`: Original audio language | ||
| - `target_language`: Translation target language | ||
| - `reviewed`: Whether transcript has been reviewed | ||
| - `meeting_id`: Associated meeting ID (if applicable) | ||
| - `source_kind`: Source type (live, file, room) | ||
| - `room_id`: Associated room ID (if applicable) | ||
| - `audio_deleted`: Whether audio has been deleted | ||
| - `participants`: Array of participant objects with speaker mappings | ||
|
|
||
| ### Format-Specific Fields | ||
|
|
||
| - `transcript_format`: The format identifier (discriminator field) | ||
| - `transcript`: The formatted transcript content (string for text/webvtt formats, array for json format) | ||
|
|
||
| ## Speaker Name Resolution | ||
|
|
||
| All formats resolve speaker IDs to participant names when available: | ||
|
|
||
| - If a participant exists for the speaker ID, their name is used | ||
| - If no participant exists, a default name like "Speaker 0" is generated | ||
| - Speaker IDs are integers (0, 1, 2, etc.) assigned during diarization | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| """Schema definitions for transcript format types and segments.""" | ||
|
|
||
| from typing import Literal | ||
|
|
||
| from pydantic import BaseModel | ||
|
|
||
| TranscriptFormat = Literal["text", "text-timestamped", "webvtt-named", "json"] | ||
|
|
||
|
|
||
| class TranscriptSegment(BaseModel): | ||
| """A single transcript segment with speaker and timing information.""" | ||
|
|
||
| speaker: int | ||
| speaker_name: str | ||
| text: str | ||
| start: float | ||
| end: float |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| """Utilities for converting transcript data to various output formats.""" | ||
|
|
||
| import webvtt | ||
|
|
||
| from reflector.db.transcripts import TranscriptParticipant, TranscriptTopic | ||
| from reflector.processors.types import Transcript as ProcessorTranscript | ||
| from reflector.processors.types import words_to_segments | ||
| from reflector.schemas.transcript_formats import TranscriptSegment | ||
| from reflector.utils.webvtt import _seconds_to_timestamp | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder why I'd underscore it. Probably if it is used outside of the file we can remove |
||
|
|
||
|
|
||
| def get_speaker_name( | ||
| speaker: int, participants: list[TranscriptParticipant] | None | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok for the None. For the rest, there is no multiple speaker/participant, the whole point of speaker id is that it is supposed to be unique for a person/speaker. That's what diarization is for. So no, there is no multiple speaker assigned to the same portion of words. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And you raise the point about broken data or such PRIOR the call, it's not when you start rendering that you need to care about broken data. Not even starting to emit a format that can work around broken data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. read in isolation, only us being happen to remember that it is not the case makes it readable enough to us take it or leave it, I'm not insisting, but that's what I see |
||
| ) -> str: | ||
| """Get participant name for speaker or default to 'Speaker N'.""" | ||
| if participants: | ||
| for participant in participants: | ||
| if participant.speaker == speaker: | ||
| return participant.name | ||
| return f"Speaker {speaker}" | ||
|
|
||
|
|
||
| def format_timestamp_mmss(seconds: float) -> str: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: |
||
| """Format seconds as MM:SS timestamp.""" | ||
| minutes = int(seconds // 60) | ||
| secs = int(seconds % 60) | ||
| return f"{minutes:02d}:{secs:02d}" | ||
|
|
||
|
|
||
| def transcript_to_text( | ||
| topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same question to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And participants can be empty, per historical data. Not empty list, just None not filled. It was easier to do it there rather than changing all callers, and handle the special case of None and pass empty list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are enabling the historical development process to leak into new code this way. This stuff compounds IMO |
||
| ) -> str: | ||
| """Convert transcript topics to plain text with speaker names.""" | ||
| lines = [] | ||
| for topic in topics: | ||
| if not topic.words: | ||
| continue | ||
|
|
||
| transcript = ProcessorTranscript(words=topic.words) | ||
| segments = transcript.as_segments() | ||
|
|
||
| for segment in segments: | ||
| speaker_name = get_speaker_name(segment.speaker, participants) | ||
| text = segment.text.strip() | ||
| lines.append(f"{speaker_name}: {text}") | ||
|
|
||
| return "\n".join(lines) | ||
|
|
||
|
|
||
| def transcript_to_text_timestamped( | ||
| topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None | ||
| ) -> str: | ||
| """Convert transcript topics to timestamped text with speaker names.""" | ||
| lines = [] | ||
| for topic in topics: | ||
| if not topic.words: | ||
| continue | ||
|
|
||
| transcript = ProcessorTranscript(words=topic.words) | ||
| segments = transcript.as_segments() | ||
|
|
||
| for segment in segments: | ||
| speaker_name = get_speaker_name(segment.speaker, participants) | ||
| timestamp = format_timestamp_mmss(segment.start) | ||
| text = segment.text.strip() | ||
| lines.append(f"[{timestamp}] {speaker_name}: {text}") | ||
|
|
||
| return "\n".join(lines) | ||
|
|
||
|
|
||
| def topics_to_webvtt_named( | ||
| topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None | ||
| ) -> str: | ||
| """Convert transcript topics to WebVTT format with participant names.""" | ||
| vtt = webvtt.WebVTT() | ||
|
|
||
| for topic in topics: | ||
| if not topic.words: | ||
| continue | ||
|
|
||
| segments = words_to_segments(topic.words) | ||
|
|
||
| for segment in segments: | ||
| speaker_name = get_speaker_name(segment.speaker, participants) | ||
| text = segment.text.strip() | ||
| text = f"<v {speaker_name}>{text}" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd leave // https://www.w3.org/TR/webvtt1/#introduction-caption comment for documentation completeness; it's a standard and not going to change re: this code, and it's nice to reinforce the turst of the reader that we don't invent anything here (I as a reader may suspect so seeing string raw concatenation) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But yes, it's standard of Webvtt, why putting it here ? I mean we don't do that on every part of the implementation (like the timestamp formatting, the -->, why this tag specifically ?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd just make sure it's reasonably obvious thorough any place where raw string concatenation is done. Doesn't have to be specifically on this line. If no such contextually appropriate place is out there e.g. the format is implemented across different files, I'd just add the comment everywhere the format is being implementad |
||
|
|
||
| caption = webvtt.Caption( | ||
| start=_seconds_to_timestamp(segment.start), | ||
| end=_seconds_to_timestamp(segment.end), | ||
| text=text, | ||
| ) | ||
| vtt.captions.append(caption) | ||
|
|
||
| return vtt.content | ||
|
|
||
|
|
||
| def transcript_to_json_segments( | ||
| topics: list[TranscriptTopic], participants: list[TranscriptParticipant] | None | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to comments about |
||
| ) -> list[TranscriptSegment]: | ||
| """Convert transcript topics to a flat list of JSON segments.""" | ||
| segments = [] | ||
|
|
||
| for topic in topics: | ||
| if not topic.words: | ||
| continue | ||
|
|
||
| transcript = ProcessorTranscript(words=topic.words) | ||
| for segment in transcript.as_segments(): | ||
| speaker_name = get_speaker_name(segment.speaker, participants) | ||
| segments.append( | ||
| TranscriptSegment( | ||
| speaker=segment.speaker, | ||
| speaker_name=speaker_name, | ||
| text=segment.text.strip(), | ||
| start=segment.start, | ||
| end=segment.end, | ||
| ) | ||
| ) | ||
|
|
||
| return segments | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example of multiple speakers at the same time could be beneficial here