Skip to content

Conversation

@mattjcly
Copy link
Contributor

Summary

Enable computation of timestamps within parakeet_runner through a --timestamps flag (none|token|word|segment|all). Followed reference implementation from NVIDIA-NeMo/NeMo, which is cited as the way to run parakeet and compute timestamps for nvidia/parakeet-tdt-0.6b-v3 on HF.

Requires meta-pytorch/tokenizers#163 for the id_to_piece method on Tokenizers.

Test plan

Outputs the exact same transcription/timestamps as NVIDIA-NeMo/NeMo for audio files basketball.wav and audio.wav

NeMo (basketball.wav)

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v3")

output = asr_model.transcribe(['basketball.wav'], timestamps=True)

# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

for stamp in word_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['word']}")

for stamp in char_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['char']}")
Click to see output
Transcribing: 1it [00:01,  1.82s/it]
0.56s - 2.88s : All right, ballerina for the Nuggets and Mavs tonight.
2.96s - 3.6s : Late fourth quarter.
3.68s - 4.88s : Joker to MPJ.
4.96s - 5.68s : Back to you.
5.92s - 7.92s : Joker spins, fakes, scores.
8.08s - 10.24s : Was tied at 118 with two minutes to go.
10.32s - 11.52s : Then the Mavs by two.
11.6s - 12.48s : Joker drives.
12.56s - 14.64s : He'll miss but clean up his own mess.
14.8s - 16.080000000000002s : Tied at 120.
16.240000000000002s - 17.44s : Then time running out.
17.68s - 19.44s : Jamal to MPJ.
20.0s - 22.080000000000002s : He drives, stops, and pops.
22.16s - 23.04s : He had 17.
23.12s - 25.44s : The Nuggets had the lead with 6.5 to go.
25.6s - 26.8s : Mavs had a last chance.
26.88s - 31.28s : Kyrie Irving scored 43 points, but didn't hit the game winner.
31.44s - 32.24s : Joker got it.
32.32s - 34.56s : The Nuggets win at 122-120.
34.800000000000004s - 36.480000000000004s : Incredible stat line for Jokic.
36.72s - 37.36s : One of a kind.
37.6s - 40.88s : 37 points, 18 rebounds, 15 assists.
41.04s - 42.24s : He is simply the best.
42.4s - 45.92s : And the Nuggets have won four in a row and six of their last seven.
0.56s - 0.72s : All
0.72s - 0.8s : right,
0.96s - 1.76s : ballerina
1.76s - 1.84s : for
1.84s - 1.92s : the
1.92s - 2.24s : Nuggets
2.24s - 2.32s : and
2.32s - 2.64s : Mavs
2.64s - 2.88s : tonight.
2.96s - 3.2s : Late
3.2s - 3.44s : fourth
3.44s - 3.6s : quarter.
3.68s - 4.16s : Joker
4.16s - 4.32s : to
4.32s - 4.88s : MPJ.
4.96s - 5.2s : Back
5.2s - 5.44s : to
5.44s - 5.68s : you.
5.92s - 6.48s : Joker
6.48s - 6.88s : spins,
6.96s - 7.36s : fakes,
7.5200000000000005s - 7.92s : scores.
8.08s - 8.32s : Was
8.32s - 8.64s : tied
8.64s - 8.72s : at
8.72s - 9.36s : 118
9.36s - 9.52s : with
9.52s - 9.68s : two
9.68s - 9.92s : minutes
9.92s - 10.0s : to
10.0s - 10.24s : go.
10.32s - 10.56s : Then
10.56s - 10.64s : the
10.64s - 11.120000000000001s : Mavs
11.120000000000001s - 11.200000000000001s : by
11.28s - 11.52s : two.
11.6s - 12.16s : Joker
12.16s - 12.48s : drives.
12.56s - 12.96s : He'll
12.96s - 13.200000000000001s : miss
13.280000000000001s - 13.44s : but
13.44s - 13.68s : clean
13.84s - 13.92s : up
13.92s - 14.08s : his
14.08s - 14.32s : own
14.32s - 14.64s : mess.
14.8s - 15.200000000000001s : Tied
15.200000000000001s - 15.36s : at
15.36s - 16.080000000000002s : 120.
16.240000000000002s - 16.48s : Then
16.48s - 16.72s : time
16.88s - 17.28s : running
17.28s - 17.44s : out.
17.68s - 18.400000000000002s : Jamal
18.400000000000002s - 18.72s : to
18.72s - 19.44s : MPJ.
20.0s - 20.240000000000002s : He
20.240000000000002s - 20.72s : drives,
20.96s - 21.28s : stops,
21.36s - 21.6s : and
21.6s - 22.080000000000002s : pops.
22.16s - 22.32s : He
22.400000000000002s - 22.64s : had
22.64s - 23.04s : 17.
23.12s - 23.28s : The
23.28s - 23.68s : Nuggets
23.68s - 23.84s : had
23.84s - 23.92s : the
23.92s - 24.240000000000002s : lead
24.240000000000002s - 24.400000000000002s : with
24.400000000000002s - 25.12s : 6.5
25.12s - 25.28s : to
25.28s - 25.44s : go.
25.6s - 26.16s : Mavs
26.16s - 26.32s : had
26.32s - 26.400000000000002s : a
26.400000000000002s - 26.560000000000002s : last
26.560000000000002s - 26.8s : chance.
26.88s - 27.36s : Kyrie
27.52s - 28.0s : Irving
28.16s - 28.560000000000002s : scored
28.560000000000002s - 29.2s : 43
29.2s - 29.6s : points,
29.76s - 30.080000000000002s : but
30.080000000000002s - 30.48s : didn't
30.48s - 30.64s : hit
30.64s - 30.72s : the
30.72s - 30.96s : game
30.96s - 31.28s : winner.
31.44s - 32.0s : Joker
32.0s - 32.160000000000004s : got
32.160000000000004s - 32.24s : it.
32.32s - 32.480000000000004s : The
32.480000000000004s - 32.88s : Nuggets
32.88s - 33.12s : win
33.12s - 33.28s : at
33.28s - 34.56s : 122-120.
34.800000000000004s - 35.36s : Incredible
35.36s - 35.6s : stat
35.6s - 35.84s : line
35.84s - 36.0s : for
36.0s - 36.480000000000004s : Jokic.
36.72s - 36.88s : One
36.88s - 37.04s : of
37.04s - 37.12s : a
37.12s - 37.36s : kind.
37.6s - 38.160000000000004s : 37
38.160000000000004s - 38.56s : points,
38.72s - 39.2s : 18
39.2s - 39.68s : rebounds,
39.84s - 40.32s : 15
40.32s - 40.88s : assists.
41.04s - 41.2s : He
41.2s - 41.44s : is
41.44s - 41.76s : simply
41.76s - 42.0s : the
42.0s - 42.24s : best.
42.4s - 42.56s : And
42.56s - 42.64s : the
42.64s - 43.12s : Nuggets
43.12s - 43.28s : have
43.28s - 43.52s : won
43.52s - 43.76s : four
43.76s - 43.92s : in
43.92s - 44.0s : a
44.0s - 44.24s : row
44.24s - 44.56s : and
44.56s - 44.96s : six
44.96s - 45.2s : of
45.2s - 45.36s : their
45.36s - 45.6s : last
45.6s - 45.92s : seven.
0.56s - 0.72s : ['All']
0.72s - 0.8s : ['right']
0.8s - 0.8s : [',']
0.96s - 1.12s : ['b']
1.12s - 1.28s : ['all']
1.28s - 1.36s : ['er']
1.6s - 1.76s : ['ina']
1.76s - 1.84s : ['for']
1.84s - 1.92s : ['the']
1.92s - 2.0s : ['N']
2.0s - 2.08s : ['ug']
2.08s - 2.16s : ['g']
2.16s - 2.24s : ['ets']
2.24s - 2.32s : ['and']
2.32s - 2.4s : ['M']
2.4s - 2.48s : ['av']
2.48s - 2.64s : ['s']
2.64s - 2.72s : ['ton']
2.72s - 2.88s : ['ight']
2.88s - 2.88s : ['.']
2.96s - 3.04s : ['L']
3.04s - 3.2s : ['ate']
3.2s - 3.2800000000000002s : ['fo']
3.2800000000000002s - 3.36s : ['urt']
3.36s - 3.44s : ['h']
3.44s - 3.52s : ['quar']
3.52s - 3.6s : ['ter']
3.6s - 3.6s : ['.']
3.68s - 3.84s : ['J']
3.84s - 4.0s : ['ok']
4.0s - 4.16s : ['er']
4.16s - 4.32s : ['to']
4.32s - 4.48s : ['M']
4.48s - 4.64s : ['P']
4.64s - 4.88s : ['J']
4.88s - 4.88s : ['.']
4.96s - 5.04s : ['B']
5.04s - 5.2s : ['ack']
5.2s - 5.44s : ['to']
5.44s - 5.68s : ['you']
5.68s - 5.68s : ['.']
5.92s - 6.16s : ['J']
6.16s - 6.32s : ['ok']
6.32s - 6.48s : ['er']
6.48s - 6.72s : ['sp']
6.72s - 6.88s : ['ins']
6.88s - 6.88s : [',']
6.96s - 7.12s : ['fak']
7.28s - 7.36s : ['es']
7.36s - 7.36s : [',']
7.5200000000000005s - 7.76s : ['sc']
7.76s - 7.92s : ['ores']
7.92s - 7.92s : ['.']
8.08s - 8.32s : ['Was']
8.32s - 8.64s : ['tied']
8.64s - 8.72s : ['at']
8.72s - 8.8s : ['']
8.8s - 8.88s : ['1']
8.96s - 9.040000000000001s : ['1']
9.120000000000001s - 9.36s : ['8']
9.36s - 9.52s : ['with']
9.52s - 9.68s : ['two']
9.68s - 9.76s : ['minut']
9.76s - 9.92s : ['es']
9.92s - 10.0s : ['to']
10.0s - 10.24s : ['go']
10.24s - 10.24s : ['.']
10.32s - 10.4s : ['T']
10.4s - 10.56s : ['hen']
10.56s - 10.64s : ['the']
10.64s - 10.8s : ['M']
10.8s - 10.96s : ['av']
10.96s - 11.120000000000001s : ['s']
11.120000000000001s - 11.200000000000001s : ['by']
11.28s - 11.52s : ['two']
11.52s - 11.52s : ['.']
11.6s - 11.76s : ['J']
11.76s - 12.0s : ['ok']
12.0s - 12.16s : ['er']
12.16s - 12.24s : ['d']
12.24s - 12.32s : ['ri']
12.32s - 12.48s : ['ves']
12.48s - 12.48s : ['.']
12.56s - 12.72s : ['He']
12.72s - 12.72s : ["'"]
12.8s - 12.96s : ['ll']
12.96s - 13.200000000000001s : ['miss']
13.280000000000001s - 13.44s : ['but']
13.44s - 13.52s : ['c']
13.52s - 13.6s : ['le']
13.6s - 13.68s : ['an']
13.84s - 13.92s : ['up']
13.92s - 14.08s : ['his']
14.08s - 14.16s : ['o']
14.16s - 14.32s : ['wn']
14.32s - 14.48s : ['m']
14.48s - 14.64s : ['ess']
14.64s - 14.64s : ['.']
14.8s - 14.96s : ['T']
14.96s - 15.200000000000001s : ['ied']
15.200000000000001s - 15.36s : ['at']
15.36s - 15.44s : ['']
15.44s - 15.6s : ['1']
15.68s - 15.92s : ['2']
15.92s - 16.080000000000002s : ['0']
16.080000000000002s - 16.080000000000002s : ['.']
16.240000000000002s - 16.32s : ['T']
16.32s - 16.48s : ['hen']
16.48s - 16.72s : ['time']
16.88s - 17.04s : ['run']
17.04s - 17.28s : ['ning']
17.28s - 17.44s : ['out']
17.44s - 17.44s : ['.']
17.68s - 17.84s : ['J']
17.84s - 18.080000000000002s : ['am']
18.080000000000002s - 18.400000000000002s : ['al']
18.400000000000002s - 18.72s : ['to']
18.72s - 18.88s : ['M']
18.88s - 19.04s : ['P']
19.2s - 19.44s : ['J']
19.44s - 19.44s : ['.']
20.0s - 20.240000000000002s : ['He']
20.240000000000002s - 20.400000000000002s : ['d']
20.400000000000002s - 20.64s : ['ri']
20.64s - 20.72s : ['ves']
20.72s - 20.72s : [',']
20.96s - 21.12s : ['stop']
21.12s - 21.28s : ['s']
21.28s - 21.28s : [',']
21.36s - 21.6s : ['and']
21.6s - 21.84s : ['po']
21.84s - 22.080000000000002s : ['ps']
22.080000000000002s - 22.080000000000002s : ['.']
22.16s - 22.32s : ['He']
22.400000000000002s - 22.64s : ['had']
22.64s - 22.72s : ['']
22.72s - 22.88s : ['1']
22.88s - 23.04s : ['7']
23.04s - 23.04s : ['.']
23.12s - 23.28s : ['The']
23.28s - 23.36s : ['N']
23.36s - 23.44s : ['ug']
23.44s - 23.6s : ['g']
23.6s - 23.68s : ['ets']
23.68s - 23.84s : ['had']
23.84s - 23.92s : ['the']
23.92s - 24.080000000000002s : ['le']
24.080000000000002s - 24.240000000000002s : ['ad']
24.240000000000002s - 24.400000000000002s : ['with']
24.400000000000002s - 24.48s : ['']
24.48s - 24.72s : ['6']
24.72s - 24.72s : ['.']
24.88s - 25.12s : ['5']
25.12s - 25.28s : ['to']
25.28s - 25.44s : ['go']
25.44s - 25.44s : ['.']
25.6s - 25.76s : ['M']
25.76s - 26.0s : ['av']
26.0s - 26.16s : ['s']
26.16s - 26.32s : ['had']
26.32s - 26.400000000000002s : ['a']
26.400000000000002s - 26.560000000000002s : ['last']
26.560000000000002s - 26.64s : ['ch']
26.64s - 26.8s : ['ance']
26.8s - 26.8s : ['.']
26.88s - 27.04s : ['K']
27.04s - 27.28s : ['y']
27.28s - 27.36s : ['rie']
27.52s - 27.76s : ['Ir']
27.76s - 27.92s : ['v']
27.92s - 28.0s : ['ing']
28.16s - 28.32s : ['sc']
28.32s - 28.400000000000002s : ['or']
28.400000000000002s - 28.560000000000002s : ['ed']
28.560000000000002s - 28.72s : ['']
28.72s - 28.88s : ['4']
28.88s - 29.2s : ['3']
29.2s - 29.28s : ['po']
29.28s - 29.44s : ['in']
29.44s - 29.6s : ['ts']
29.6s - 29.6s : [',']
29.76s - 30.080000000000002s : ['but']
30.080000000000002s - 30.240000000000002s : ['did']
30.240000000000002s - 30.32s : ['n']
30.32s - 30.32s : ["'"]
30.48s - 30.48s : ['t']
30.48s - 30.64s : ['hit']
30.64s - 30.72s : ['the']
30.72s - 30.8s : ['g']
30.8s - 30.96s : ['ame']
30.96s - 31.04s : ['w']
31.04s - 31.2s : ['inn']
31.2s - 31.28s : ['er']
31.28s - 31.28s : ['.']
31.44s - 31.68s : ['J']
31.68s - 31.84s : ['ok']
31.84s - 32.0s : ['er']
32.0s - 32.160000000000004s : ['got']
32.160000000000004s - 32.24s : ['it']
32.24s - 32.24s : ['.']
32.32s - 32.480000000000004s : ['The']
32.480000000000004s - 32.56s : ['N']
32.56s - 32.64s : ['ug']
32.64s - 32.8s : ['g']
32.8s - 32.88s : ['ets']
32.88s - 32.96s : ['w']
32.96s - 33.12s : ['in']
33.12s - 33.28s : ['at']
33.28s - 33.36s : ['']
33.36s - 33.52s : ['1']
33.52s - 33.76s : ['2']
33.76s - 33.92s : ['2']
33.92s - 33.92s : ['-']
34.0s - 34.08s : ['1']
34.24s - 34.4s : ['2']
34.4s - 34.56s : ['0']
34.56s - 34.56s : ['.']
34.800000000000004s - 34.88s : ['In']
34.88s - 34.96s : ['c']
34.96s - 35.04s : ['re']
35.04s - 35.12s : ['di']
35.12s - 35.36s : ['ble']
35.36s - 35.6s : ['stat']
35.6s - 35.68s : ['l']
35.68s - 35.84s : ['ine']
35.84s - 36.0s : ['for']
36.0s - 36.160000000000004s : ['J']
36.160000000000004s - 36.32s : ['ok']
36.32s - 36.480000000000004s : ['ic']
36.480000000000004s - 36.480000000000004s : ['.']
36.72s - 36.800000000000004s : ['O']
36.800000000000004s - 36.88s : ['ne']
36.88s - 37.04s : ['of']
37.04s - 37.12s : ['a']
37.12s - 37.36s : ['kind']
37.36s - 37.36s : ['.']
37.6s - 37.68s : ['']
37.68s - 37.84s : ['3']
37.92s - 38.160000000000004s : ['7']
38.160000000000004s - 38.32s : ['po']
38.32s - 38.4s : ['in']
38.4s - 38.56s : ['ts']
38.56s - 38.56s : [',']
38.72s - 38.800000000000004s : ['']
38.800000000000004s - 38.96s : ['1']
38.96s - 39.2s : ['8']
39.2s - 39.36s : ['re']
39.36s - 39.44s : ['bo']
39.44s - 39.6s : ['und']
39.6s - 39.68s : ['s']
39.68s - 39.68s : [',']
39.84s - 39.92s : ['']
39.92s - 40.08s : ['1']
40.08s - 40.32s : ['5']
40.32s - 40.480000000000004s : ['ass']
40.480000000000004s - 40.64s : ['ist']
40.64s - 40.88s : ['s']
40.88s - 40.88s : ['.']
41.04s - 41.2s : ['He']
41.2s - 41.44s : ['is']
41.44s - 41.6s : ['simpl']
41.6s - 41.76s : ['y']
41.76s - 42.0s : ['the']
42.0s - 42.24s : ['best']
42.24s - 42.24s : ['.']
42.4s - 42.56s : ['And']
42.56s - 42.64s : ['the']
42.64s - 42.72s : ['N']
42.72s - 42.88s : ['ug']
42.88s - 43.04s : ['g']
43.04s - 43.12s : ['ets']
43.12s - 43.28s : ['have']
43.28s - 43.36s : ['w']
43.36s - 43.52s : ['on']
43.52s - 43.6s : ['fo']
43.6s - 43.76s : ['ur']
43.76s - 43.92s : ['in']
43.92s - 44.0s : ['a']
44.0s - 44.08s : ['ro']
44.08s - 44.24s : ['w']
44.24s - 44.56s : ['and']
44.56s - 44.72s : ['si']
44.72s - 44.96s : ['x']
44.96s - 45.2s : ['of']
45.2s - 45.36s : ['their']
45.36s - 45.6s : ['last']
45.6s - 45.76s : ['se']
45.76s - 45.92s : ['ven']
45.92s - 45.92s : ['.']

parakeet_runner (basketball.wav)

./cmake-out/examples/models/parakeet/parakeet_runner --model_path examples/models/parakeet/parakeet_tdt_exports/parakeet_tdt.pte --tokenizer_path /Users/matt/Workspace/executorch/examples/models/parakeet/parakeet_tdt_exports/tokenizer.model --audio_path basketball.wav --timestamps all
Click to see output
-> % ./cmake-out/examples/models/parakeet/parakeet_runner --model_path examples/models/parakeet/parakeet_tdt_exports/parakeet_tdt.pte --tokenizer_path /Users/matt/Workspace/executorch/examples/models/parakeet/parakeet_tdt_exports/tokenizer.model --audio_path /Users/matt/Documents/parakeet_test_audio/basketball.wav --timestamps all
I tokenizers:regex.cpp:27] Registering override fallback regex
I 00:00:00.005332 executorch:main.cpp:717] Loading model from: examples/models/parakeet/parakeet_tdt_exports/parakeet_tdt.pte
I 00:00:00.005729 executorch:main.cpp:733] Loading audio from: /Users/matt/Documents/parakeet_test_audio/basketball.wav
I 00:00:00.008822 executorch:wav_loader.h:98] WAV header detected, getting raw audio data.
I 00:00:00.008832 executorch:wav_loader.h:105] RIFF Header: RIFF
I 00:00:00.008835 executorch:wav_loader.h:106] Chunk Size: 1476488
I 00:00:00.008836 executorch:wav_loader.h:113] WAVE Header: WAVE
I 00:00:00.008840 executorch:wav_loader.h:120] Format Header: fmt 
I 00:00:00.008841 executorch:wav_loader.h:121] Format Chunk Size: 16
I 00:00:00.008843 executorch:wav_loader.h:122] Audio Format: 1
I 00:00:00.008845 executorch:wav_loader.h:123] Number of Channels: 1
I 00:00:00.008846 executorch:wav_loader.h:124] Sample Rate: 16000
I 00:00:00.008848 executorch:wav_loader.h:125] Byte Rate: 32000
I 00:00:00.008849 executorch:wav_loader.h:126] Block Align: 2
I 00:00:00.008851 executorch:wav_loader.h:127] Bits per Sample: 16
I 00:00:00.008852 executorch:wav_loader.h:132] Subchunk2Size: 1476418
I 00:00:00.009391 executorch:wav_loader.h:226] Loaded 738209 audio samples from WAV file: /Users/matt/Documents/parakeet_test_audio/basketball.wav
I 00:00:00.009403 executorch:main.cpp:736] Loaded 738209 audio samples
I 00:00:00.009432 executorch:main.cpp:747] Running preprocessor...
I 00:00:00.033216 executorch:cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
I 00:00:00.033243 executorch:cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.065268 executorch:main.cpp:772] Mel spectrogram shape: [1, 128, 4614], mel_len: 4613
I 00:00:00.065280 executorch:main.cpp:775] Running encoder...
I 00:01:19.425587 executorch:main.cpp:792] Encoder output shape: [1, 1024, 577], len=577
I 00:01:19.425620 executorch:main.cpp:833] Model metadata: vocab_size=8192, blank_id=8192, num_rnn_layers=2, pred_hidden=640, sample_rate=16000, window_stride=0.010000, encoder_subsampling_factor=8
I 00:01:19.425627 executorch:main.cpp:835] Running TDT greedy decode...
I 00:01:32.149805 executorch:main.cpp:845] Decoded 289 tokens
I 00:01:32.149823 executorch:main.cpp:848] Loading tokenizer from: /Users/matt/Workspace/executorch/examples/models/parakeet/parakeet_tdt_exports/tokenizer.model
E tokenizers:hf_tokenizer.cpp:82] Error parsing json file: [json.exception.parse_error.101] parse error at line 2, column 1: syntax error while parsing value - invalid literal; last read: '<U+000A><U+000E>'
E tokenizers:tiktoken.cpp:59] invalid tiktoken line: 
I 00:01:32.154983 executorch:llm_runner_helper.cpp:77] Loaded Sentencepiece tokenizer
Transcribed text: All right, ballerina for the Nuggets and Mavs tonight. Late fourth quarter. Joker to MPJ. Back to you. Joker spins, fakes, scores. Was tied at 118 with two minutes to go. Then the Mavs by two. Joker drives. He'll miss but clean up his own mess. Tied at 120. Then time running out. Jamal to MPJ. He drives, stops, and pops. He had 17. The Nuggets had the lead with 6.5 to go. Mavs had a last chance. Kyrie Irving scored 43 points, but didn't hit the game winner. Joker got it. The Nuggets win at 122-120. Incredible stat line for Jokic. One of a kind. 37 points, 18 rebounds, 15 assists. He is simply the best. And the Nuggets have won four in a row and six of their last seven.
I 00:01:32.155043 executorch:main.cpp:867] Computing timestamps...
I 00:01:32.155828 executorch:main.cpp:873] Derived supported_punctuation size=11

Segment timestamps:
0.56s - 2.88s : All right, ballerina for the Nuggets and Mavs tonight.
2.96s - 3.6s : Late fourth quarter.
3.68s - 4.88s : Joker to MPJ.
4.96s - 5.68s : Back to you.
5.92s - 7.92s : Joker spins, fakes, scores.
8.08s - 10.24s : Was tied at 118 with two minutes to go.
10.32s - 11.52s : Then the Mavs by two.
11.6s - 12.48s : Joker drives.
12.56s - 14.64s : He'll miss but clean up his own mess.
14.8s - 16.08s : Tied at 120.
16.24s - 17.44s : Then time running out.
17.68s - 19.44s : Jamal to MPJ.
20s - 22.08s : He drives, stops, and pops.
22.16s - 23.04s : He had 17.
23.12s - 25.44s : The Nuggets had the lead with 6.5 to go.
25.6s - 26.8s : Mavs had a last chance.
26.88s - 31.28s : Kyrie Irving scored 43 points, but didn't hit the game winner.
31.44s - 32.24s : Joker got it.
32.32s - 34.56s : The Nuggets win at 122-120.
34.8s - 36.48s : Incredible stat line for Jokic.
36.72s - 37.36s : One of a kind.
37.6s - 40.88s : 37 points, 18 rebounds, 15 assists.
41.04s - 42.24s : He is simply the best.
42.4s - 45.92s : And the Nuggets have won four in a row and six of their last seven.

Word timestamps:
0.56s - 0.72s : All
0.72s - 0.8s : right,
0.96s - 1.76s : ballerina
1.76s - 1.84s : for
1.84s - 1.92s : the
1.92s - 2.24s : Nuggets
2.24s - 2.32s : and
2.32s - 2.64s : Mavs
2.64s - 2.88s : tonight.
2.96s - 3.2s : Late
3.2s - 3.44s : fourth
3.44s - 3.6s : quarter.
3.68s - 4.16s : Joker
4.16s - 4.32s : to
4.32s - 4.88s : MPJ.
4.96s - 5.2s : Back
5.2s - 5.44s : to
5.44s - 5.68s : you.
5.92s - 6.48s : Joker
6.48s - 6.88s : spins,
6.96s - 7.36s : fakes,
7.52s - 7.92s : scores.
8.08s - 8.32s : Was
8.32s - 8.64s : tied
8.64s - 8.72s : at
8.72s - 9.36s : 118
9.36s - 9.52s : with
9.52s - 9.68s : two
9.68s - 9.92s : minutes
9.92s - 10s : to
10s - 10.24s : go.
10.32s - 10.56s : Then
10.56s - 10.64s : the
10.64s - 11.12s : Mavs
11.12s - 11.2s : by
11.28s - 11.52s : two.
11.6s - 12.16s : Joker
12.16s - 12.48s : drives.
12.56s - 12.96s : He'll
12.96s - 13.2s : miss
13.28s - 13.44s : but
13.44s - 13.68s : clean
13.84s - 13.92s : up
13.92s - 14.08s : his
14.08s - 14.32s : own
14.32s - 14.64s : mess.
14.8s - 15.2s : Tied
15.2s - 15.36s : at
15.36s - 16.08s : 120.
16.24s - 16.48s : Then
16.48s - 16.72s : time
16.88s - 17.28s : running
17.28s - 17.44s : out.
17.68s - 18.4s : Jamal
18.4s - 18.72s : to
18.72s - 19.44s : MPJ.
20s - 20.24s : He
20.24s - 20.72s : drives,
20.96s - 21.28s : stops,
21.36s - 21.6s : and
21.6s - 22.08s : pops.
22.16s - 22.32s : He
22.4s - 22.64s : had
22.64s - 23.04s : 17.
23.12s - 23.28s : The
23.28s - 23.68s : Nuggets
23.68s - 23.84s : had
23.84s - 23.92s : the
23.92s - 24.24s : lead
24.24s - 24.4s : with
24.4s - 25.12s : 6.5
25.12s - 25.28s : to
25.28s - 25.44s : go.
25.6s - 26.16s : Mavs
26.16s - 26.32s : had
26.32s - 26.4s : a
26.4s - 26.56s : last
26.56s - 26.8s : chance.
26.88s - 27.36s : Kyrie
27.52s - 28s : Irving
28.16s - 28.56s : scored
28.56s - 29.2s : 43
29.2s - 29.6s : points,
29.76s - 30.08s : but
30.08s - 30.48s : didn't
30.48s - 30.64s : hit
30.64s - 30.72s : the
30.72s - 30.96s : game
30.96s - 31.28s : winner.
31.44s - 32s : Joker
32s - 32.16s : got
32.16s - 32.24s : it.
32.32s - 32.48s : The
32.48s - 32.88s : Nuggets
32.88s - 33.12s : win
33.12s - 33.28s : at
33.28s - 34.56s : 122-120.
34.8s - 35.36s : Incredible
35.36s - 35.6s : stat
35.6s - 35.84s : line
35.84s - 36s : for
36s - 36.48s : Jokic.
36.72s - 36.88s : One
36.88s - 37.04s : of
37.04s - 37.12s : a
37.12s - 37.36s : kind.
37.6s - 38.16s : 37
38.16s - 38.56s : points,
38.72s - 39.2s : 18
39.2s - 39.68s : rebounds,
39.84s - 40.32s : 15
40.32s - 40.88s : assists.
41.04s - 41.2s : He
41.2s - 41.44s : is
41.44s - 41.76s : simply
41.76s - 42s : the
42s - 42.24s : best.
42.4s - 42.56s : And
42.56s - 42.64s : the
42.64s - 43.12s : Nuggets
43.12s - 43.28s : have
43.28s - 43.52s : won
43.52s - 43.76s : four
43.76s - 43.92s : in
43.92s - 44s : a
44s - 44.24s : row
44.24s - 44.56s : and
44.56s - 44.96s : six
44.96s - 45.2s : of
45.2s - 45.36s : their
45.36s - 45.6s : last
45.6s - 45.92s : seven.

Token timestamps:
0.56s - 0.72s : All
0.72s - 0.8s : right
0.8s - 0.8s : ,
0.96s - 1.12s : b
1.12s - 1.28s : all
1.28s - 1.36s : er
1.6s - 1.76s : ina
1.76s - 1.84s : for
1.84s - 1.92s : the
1.92s - 2s : N
2s - 2.08s : ug
2.08s - 2.16s : g
2.16s - 2.24s : ets
2.24s - 2.32s : and
2.32s - 2.4s : M
2.4s - 2.48s : av
2.48s - 2.64s : s
2.64s - 2.72s : ton
2.72s - 2.88s : ight
2.88s - 2.88s : .
2.96s - 3.04s : L
3.04s - 3.2s : ate
3.2s - 3.28s : fo
3.28s - 3.36s : urt
3.36s - 3.44s : h
3.44s - 3.52s : quar
3.52s - 3.6s : ter
3.6s - 3.6s : .
3.68s - 3.84s : J
3.84s - 4s : ok
4s - 4.16s : er
4.16s - 4.32s : to
4.32s - 4.48s : M
4.48s - 4.64s : P
4.64s - 4.88s : J
4.88s - 4.88s : .
4.96s - 5.04s : B
5.04s - 5.2s : ack
5.2s - 5.44s : to
5.44s - 5.68s : you
5.68s - 5.68s : .
5.92s - 6.16s : J
6.16s - 6.32s : ok
6.32s - 6.48s : er
6.48s - 6.72s : sp
6.72s - 6.88s : ins
6.88s - 6.88s : ,
6.96s - 7.12s : fak
7.28s - 7.36s : es
7.36s - 7.36s : ,
7.52s - 7.76s : sc
7.76s - 7.92s : ores
7.92s - 7.92s : .
8.08s - 8.32s : Was
8.32s - 8.64s : tied
8.64s - 8.72s : at
8.72s - 8.8s : 
8.8s - 8.88s : 1
8.96s - 9.04s : 1
9.12s - 9.36s : 8
9.36s - 9.52s : with
9.52s - 9.68s : two
9.68s - 9.76s : minut
9.76s - 9.92s : es
9.92s - 10s : to
10s - 10.24s : go
10.24s - 10.24s : .
10.32s - 10.4s : T
10.4s - 10.56s : hen
10.56s - 10.64s : the
10.64s - 10.8s : M
10.8s - 10.96s : av
10.96s - 11.12s : s
11.12s - 11.2s : by
11.28s - 11.52s : two
11.52s - 11.52s : .
11.6s - 11.76s : J
11.76s - 12s : ok
12s - 12.16s : er
12.16s - 12.24s : d
12.24s - 12.32s : ri
12.32s - 12.48s : ves
12.48s - 12.48s : .
12.56s - 12.72s : He
12.72s - 12.72s : '
12.8s - 12.96s : ll
12.96s - 13.2s : miss
13.28s - 13.44s : but
13.44s - 13.52s : c
13.52s - 13.6s : le
13.6s - 13.68s : an
13.84s - 13.92s : up
13.92s - 14.08s : his
14.08s - 14.16s : o
14.16s - 14.32s : wn
14.32s - 14.48s : m
14.48s - 14.64s : ess
14.64s - 14.64s : .
14.8s - 14.96s : T
14.96s - 15.2s : ied
15.2s - 15.36s : at
15.36s - 15.44s : 
15.44s - 15.6s : 1
15.68s - 15.92s : 2
15.92s - 16.08s : 0
16.08s - 16.08s : .
16.24s - 16.32s : T
16.32s - 16.48s : hen
16.48s - 16.72s : time
16.88s - 17.04s : run
17.04s - 17.28s : ning
17.28s - 17.44s : out
17.44s - 17.44s : .
17.68s - 17.84s : J
17.84s - 18.08s : am
18.08s - 18.4s : al
18.4s - 18.72s : to
18.72s - 18.88s : M
18.88s - 19.04s : P
19.2s - 19.44s : J
19.44s - 19.44s : .
20s - 20.24s : He
20.24s - 20.4s : d
20.4s - 20.64s : ri
20.64s - 20.72s : ves
20.72s - 20.72s : ,
20.96s - 21.12s : stop
21.12s - 21.28s : s
21.28s - 21.28s : ,
21.36s - 21.6s : and
21.6s - 21.84s : po
21.84s - 22.08s : ps
22.08s - 22.08s : .
22.16s - 22.32s : He
22.4s - 22.64s : had
22.64s - 22.72s : 
22.72s - 22.88s : 1
22.88s - 23.04s : 7
23.04s - 23.04s : .
23.12s - 23.28s : The
23.28s - 23.36s : N
23.36s - 23.44s : ug
23.44s - 23.6s : g
23.6s - 23.68s : ets
23.68s - 23.84s : had
23.84s - 23.92s : the
23.92s - 24.08s : le
24.08s - 24.24s : ad
24.24s - 24.4s : with
24.4s - 24.48s : 
24.48s - 24.72s : 6
24.72s - 24.72s : .
24.88s - 25.12s : 5
25.12s - 25.28s : to
25.28s - 25.44s : go
25.44s - 25.44s : .
25.6s - 25.76s : M
25.76s - 26s : av
26s - 26.16s : s
26.16s - 26.32s : had
26.32s - 26.4s : a
26.4s - 26.56s : last
26.56s - 26.64s : ch
26.64s - 26.8s : ance
26.8s - 26.8s : .
26.88s - 27.04s : K
27.04s - 27.28s : y
27.28s - 27.36s : rie
27.52s - 27.76s : Ir
27.76s - 27.92s : v
27.92s - 28s : ing
28.16s - 28.32s : sc
28.32s - 28.4s : or
28.4s - 28.56s : ed
28.56s - 28.72s : 
28.72s - 28.88s : 4
28.88s - 29.2s : 3
29.2s - 29.28s : po
29.28s - 29.44s : in
29.44s - 29.6s : ts
29.6s - 29.6s : ,
29.76s - 30.08s : but
30.08s - 30.24s : did
30.24s - 30.32s : n
30.32s - 30.32s : '
30.48s - 30.48s : t
30.48s - 30.64s : hit
30.64s - 30.72s : the
30.72s - 30.8s : g
30.8s - 30.96s : ame
30.96s - 31.04s : w
31.04s - 31.2s : inn
31.2s - 31.28s : er
31.28s - 31.28s : .
31.44s - 31.68s : J
31.68s - 31.84s : ok
31.84s - 32s : er
32s - 32.16s : got
32.16s - 32.24s : it
32.24s - 32.24s : .
32.32s - 32.48s : The
32.48s - 32.56s : N
32.56s - 32.64s : ug
32.64s - 32.8s : g
32.8s - 32.88s : ets
32.88s - 32.96s : w
32.96s - 33.12s : in
33.12s - 33.28s : at
33.28s - 33.36s : 
33.36s - 33.52s : 1
33.52s - 33.76s : 2
33.76s - 33.92s : 2
33.92s - 33.92s : -
34s - 34.08s : 1
34.24s - 34.4s : 2
34.4s - 34.56s : 0
34.56s - 34.56s : .
34.8s - 34.88s : In
34.88s - 34.96s : c
34.96s - 35.04s : re
35.04s - 35.12s : di
35.12s - 35.36s : ble
35.36s - 35.6s : stat
35.6s - 35.68s : l
35.68s - 35.84s : ine
35.84s - 36s : for
36s - 36.16s : J
36.16s - 36.32s : ok
36.32s - 36.48s : ic
36.48s - 36.48s : .
36.72s - 36.8s : O
36.8s - 36.88s : ne
36.88s - 37.04s : of
37.04s - 37.12s : a
37.12s - 37.36s : kind
37.36s - 37.36s : .
37.6s - 37.68s : 
37.68s - 37.84s : 3
37.92s - 38.16s : 7
38.16s - 38.32s : po
38.32s - 38.4s : in
38.4s - 38.56s : ts
38.56s - 38.56s : ,
38.72s - 38.8s : 
38.8s - 38.96s : 1
38.96s - 39.2s : 8
39.2s - 39.36s : re
39.36s - 39.44s : bo
39.44s - 39.6s : und
39.6s - 39.68s : s
39.68s - 39.68s : ,
39.84s - 39.92s : 
39.92s - 40.08s : 1
40.08s - 40.32s : 5
40.32s - 40.48s : ass
40.48s - 40.64s : ist
40.64s - 40.88s : s
40.88s - 40.88s : .
41.04s - 41.2s : He
41.2s - 41.44s : is
41.44s - 41.6s : simpl
41.6s - 41.76s : y
41.76s - 42s : the
42s - 42.24s : best
42.24s - 42.24s : .
42.4s - 42.56s : And
42.56s - 42.64s : the
42.64s - 42.72s : N
42.72s - 42.88s : ug
42.88s - 43.04s : g
43.04s - 43.12s : ets
43.12s - 43.28s : have
43.28s - 43.36s : w
43.36s - 43.52s : on
43.52s - 43.6s : fo
43.6s - 43.76s : ur
43.76s - 43.92s : in
43.92s - 44s : a
44s - 44.08s : ro
44.08s - 44.24s : w
44.24s - 44.56s : and
44.56s - 44.72s : si
44.72s - 44.96s : x
44.96s - 45.2s : of
45.2s - 45.36s : their
45.36s - 45.6s : last
45.6s - 45.76s : se
45.76s - 45.92s : ven
45.92s - 45.92s : .

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16545

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 55 Pending

As of commit 7ada1eb with merge base dbf3c37 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mergennachin
Copy link
Contributor

@mattjcly please rebase on top of #16323

@mattjcly mattjcly force-pushed the matt/parakeet-timestamps-new-tokenizer-method branch from 831f714 to 0c9768d Compare January 13, 2026 14:57
if (!num_rnn_layers_result.ok() || !pred_hidden_result.ok() ||
!vocab_size_result.ok() || !blank_id_result.ok() ||
!sample_rate_result.ok()) {
!sample_rate_result.ok() || !window_stride_result.ok() ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will break compat with previously exported parakeet models. I chose to do this b/c early in development and to avoid having to make a separate path that allows everything but timestamps if the new metadata isn't present.

Open to doing such a thing if reviewers feel strongly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i don't mind bc breaking at this early stage.

And it's in examples anyway not in core directory (like /extensions)

Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on extracting out the abstraction for TimestampExtractor into its own file that could be used for other models in the future. The main runner is becoming pretty large now.


if (!word.empty() && (ends_with_delimiter || is_delimiter_word)) {
segment_words.push_back(word);
if (!segment_words.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems redundant check right after push back

@mattjcly
Copy link
Contributor Author

mattjcly commented Jan 13, 2026

Thoughts on extracting out the abstraction for TimestampExtractor into its own file that could be used for other models in the future. The main runner is becoming pretty large now.

Made the following adjustments:

I decided against a TimestampExtractor class/abstraction for the time being and instead went for "modules of free functions" because I didn't see anything to require statefulness and I didn't want to make an over-abstraction/over-generalization for something that is only used by parakeet at the moment.

I think this walks an appropriate middle ground for the current point in development - modularized code that can be easily extended if/when use with other models arrises, while avoiding too deep of an abstraction that may not work for other models. This is also why these utils are in namespaces parakeet::timestamp_utils and parakeet::tokenizer_utils for now, but can easily be moved to a more common place as-needed in future.

This is also nice because the NeMo ported code remains almost trivially mappable to the original NeMo files/methods.

@mergennachin does this seem reasonable to you? Open to discussion if you still feel there is a better approach.

Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment

| `--audio_path` | Path to input audio file (.wav) |
| `--tokenizer_path` | Path to tokenizer file (default: `tokenizer.json`) |
| `--data_path` | Path to data file (.ptd) for delegate data (optional, required for CUDA) |
| `--timestamps` | Timestamp output mode: `none\|token\|word\|segment\|all` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we default to one of the options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, defaulted to "segment" because the others can get pretty verbose in output

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say in README.md that segment is the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)


namespace parakeet::tokenizer_utils {

std::unordered_set<std::string> derive_supported_punctuation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on moving this to timestap_utils?

It's exclusively called once in main.cpp right before timestamp computation and passed to timestamp functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it fits a bit better in tokenizer_utils, since logically it only touches the tokenizer and does nothing timestamp related. Additionally it is in tokenizer_utils.py in NeMo reference implementation.

Contextually though, agree it is only needed in the parakeet_runner when we compute timestamps. Sortof a growing pain of some level of abstraction/modularization before needed elsewhere?

I personally would leave it, but can move it if you feel its important/a blocker. Don't feel strongly enough about it to blockingly oppose

@mergennachin mergennachin merged commit 9510334 into pytorch:main Jan 13, 2026
140 checks passed
@mattjcly mattjcly deleted the matt/parakeet-timestamps-new-tokenizer-method branch January 13, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants