13 Transcript

The Transcript class is designed to handle text-to-speech outputs generated by machine learning models, such as OpenAI's Whisper. It supports outputs that include word-level timestamps.

Constructing a Transcript

You typically create a Transcript instance from JSON data. The JSON should adhere to the following structure:

type Captions = {
	token: string; 	// The spoken word
	start: number; 	// The start in milliseconds
	stop: number;	// The stop in milliseconds
}[][];

The JSON structure is a 3-dimensional array, where the first level represents sentences, and each sentence contains a list of words or tokens. This structure preserves the semantic grouping of words.

To create a Transcript from this JSON, use the following:

import * as core from '@diffusionstudio/core';

const transcript = core.Transcript.fromJSON(captions); // `captions` is of type Captions

// or

const transcript = await core.Transcript.from('https://.../captions.json'); // to load from a remote JSON file

Manual Construction

You can also manually create a Transcript instance:

import * as core from '@diffusionstudio/core';

const transcript = new core.Transcript([
  new core.WordGroup([
    new core.Word('Hello', 0, 300),
    new core.Word('World', 320, 600),
  ])
]);

Utility Methods

The Transcript class provides several utility methods:

transcript.optimize();
transcript.toSRT();
transcript.slice(20);

optimize(): Adjusts the timestamps of words to improve readability when aligned on a timeline.
toSRT(): Converts the transcript to an SRT format blob, which can be downloaded and used with most video editing applications.
slice(wordCount: number): Creates a new Transcript containing only the specified number of words. This is useful for generating preview captions.

Iterating Over Words

The Transcript class offers a powerful iteration method via the iter function:

for (const group of transcript.iter({ count: [2] })) {
  // Each group will contain up to two words
}

The iter method allows you to iterate over words with various options, introducing a degree of randomness to improve captioning quality. If two values are provided, a random number between them is chosen.

Those are the available options for iteration:

count: iterate by word count
duration: iterate by group duration
length: iterate by the number of characters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

13 Transcript

Constructing a Transcript

Manual Construction

Utility Methods

Iterating Over Words

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally