Some SRT seqments are way to big - Silero VAD problem #470

ei23fxg · 2025-01-22T14:29:35Z

Which OS are you using?

Linux Debian, running the Docker Container with cuda
insanely fast whisper
default parameters
model: whisper-large-v3-turbo

When I create a timestamp file, I sometimes get very large sections that are more or less useless to use for subtitles.
I'm not entirely sure if this has always been the case or if it came with an update. I suspect it has something to do with the SILERO VAD filter.
If the SILERO VAD filter is disabled, the problem definitely occurs frequently.
I have already tested some settings on the SILERO VAD filter. This changes the results to some extent. However, I have not been able to achieve the desired result.
I'm sure, im doing something wrong, but others will surely run into this issue. It would be useful to mention some best practise about this in the documentation.

large-v3 has the same issue.

Below is an SRT example for illustrational purpose.

...
9
00:02:51,580 --> 00:07:05,080
Pellentesque bibendum mollis fames maecenas quis in sapien. Ridiculus placerat interdum rutrum auctor posuere aliquet viverra vivamus. Dis neque fermentum dapibus lobortis tempus ut. Ultricies phasellus posuere nisl magna suspendisse semper. Eleifend ipsum enim et orci; ridiculus in nam donec placerat. Dapibus integer sapien mi massa cubilia.
Malesuada fermentum venenatis, blandit justo magna condimentum at? Ac nibh aliquet sodales nullam praesent euismod. Montes class congue donec sodales, nullam eget sem. Facilisis sociosqu aptent volutpat habitant elementum finibus condimentum. Lacus sit praesent id ex accumsan sed in erat. Erat finibus egestas ex ullamcorper cras mauris nibh laoreet. Dapibus ut mollis dapibus nulla lacinia mattis. Viverra sapien cras netus maximus orci. Interdum metus et integer mauris pulvinar gravida.
Rutrum quis est ultrices faucibus semper nibh pellentesque scelerisque. Proin arcu parturient praesent ligula eros dis consequat primis felis. Vitae himenaeos dui at conubia praesent etiam. Auctor platea eleifend ante sed nec euismod dapibus ante dignissim. Fames velit pretium fames placerat felis turpis commodo! Senectus tempus torquent commodo nisl laoreet consequat. Montes est magna auctor; accumsan non pulvinar consequat. Erat tellus torquent porttitor duis laoreet volutpat dictumst. Porttitor dignissim at etiam lacus morbi.

10
00:07:05,560 --> 00:07:10,180
Lorem ipsum odor amet, consectetuer adipiscing elit. Pellentesque dictum

11
00:07:10,520 --> 00:07:15,040
Erat finibus egestas ex ullamcorper cras mauris nibh laoreet
...

The text was updated successfully, but these errors were encountered:

jhj0517 · 2025-01-22T14:57:15Z

Hi. Could you try the VAD again with these settings in the "Voice Detection Filter" tab:

And see if it still happens?

ei23fxg · 2025-01-22T15:05:47Z

Hi. Could you try the VAD again with these settings in the "Voice Detection Filter" tab:

And see if it still happens?

Hi, thanks for fast reply.

Yes, I tried that.
activated, deactivated VAD filter makes no difference.
threshold and speech padding made some difference, but I guess that's to be expected.
Same behavior for large V3 (but different result)

ei23fxg · 2025-01-22T15:34:28Z

I found an old subtitle file that I had generated, i think it was large v3 back on 2024-11-25

This is what it looked like back then:


`1
00:00:00,000 --> 00:00:03,940
Hey everyone, ready to dive into something really mind-bending?

2
00:00:04,139 --> 00:00:05,240
Always up for a challenge.

3
00:00:06,320 --> 00:00:15,380
Today we're going to explore this wild idea that coherence might be like the secret ingredient

4
00:00:15,380 --> 00:00:20,839
to intelligence, consciousness, and maybe even like the universe itself.

5
00:00:20,899 --> 00:00:21,339
I love it.

6
00:00:21,399 --> 00:00:22,160
Pretty ambitious, right?

7
00:00:22,280 --> 00:00:22,480
Yeah.

8
00:00:22,600 --> 00:00:27,219
But we're not just theorizing here, we've got some seriously cool source material to work with.

9
00:00:27,280 --> 00:00:27,480
Yeah.

10
00:00:27,559 --> 00:00:31,859
So this researcher, David Shapiro, had this amazing conversation with, get this, Claude.

11
00:00:31,960 --> 00:00:34,979
One of Anthropik's, like, star large language models.

12
00:00:35,060 --> 00:00:35,420
Exactly.

13
00:00:35,619 --> 00:00:36,659
And they're back and forth.

and thats today

`1
00:00:00,000 --> 00:02:06,340
Hey everyone, ready to dive into something really mind-bending? Always up for a challenge. Today we're going to explore this wild idea that coherence might be like the secret ingredient to intelligence, consciousness, and maybe even like the universe itself. I love it. Pretty ambitious, right? Yeah. But we're not just theorizing here, we've got some seriously cool source material to work with. Yeah. So this researcher, David Shapiro, had this amazing conversation with, get this, Claude. One of Anthropics, like, star large language models. Exactly. And they're back and forth. It's not just like techie jargon. No, no. It's like philosophy meets AI. Yeah. They go deep on what makes us, you know, us. What is the essence of being human? Exactly. And by the end of this deep dive, I guarantee you'll see things in a whole new way. Totally. Evolution, politics, even your own decisions, everything will look different. So where do we even start with this coherence thing? Like, what are we even talking about? Okay, good question. So Claude, the AI, describes coherence as this like fundamental drive towards stable, organized patterns. And this kind of aversion to anything that like doesn't fit. So it's like our brains are always trying to make sense of the world, find the order in the chaos. Right. OK, I can see that. But where does this actually show up? Like in real life? Well, think about evolution for a second. The species and adaptations that survive are the ones that, you know, fit their environment best, the ones that are most coherent with their surroundings. So it's like survival of the fittest, but fittest actually means most coherent. Exactly. And this coherence thing, it's not just biology. Claude argues it's like fundamental to how we think too. Really? How so? I'm intrigued. Okay. Think about political ideologies, religions, even your personal philosophy. They're all trying to build this like complete internally consistent model of reality. So even when ideologies clash, they're each striving for coherence within their own little bubble. Yeah, precisely. And this clash of coherent systems, it gets even bigger when you think about geopolitics.

2
00:02:06,920 --> 00:02:46,420
Nations forming blocks, vying for power, trying to create stability in their own spheres of influence. Wow. So from tiny organisms evolving to like global power struggles. Yeah. It's all about this drive for coherence. It's pretty mind blowing, right? Yeah, it is. And then get this, Claude suggests that even consciousness itself could come from this same drive. Whoa. Okay, elaborate on that. So picture consciousness as this high-level process that's constantly trying to create a unified model of you, of your experiences, your place in the world, always trying to make sense of everything. So my brain right now trying to wrap itself around this concept is actually demonstrating coherence in action.

3
00:02:46,960 --> 00:02:47,440
That's meta.`

Same File

jhj0517 · 2025-01-22T17:04:29Z

It seems like that this is a faster-whisper's VAD implementation issue, which is also tracked by other issues.

According to #396 (comment), whisperx's VAD implementation gives better results, so I'm planning to add it later.

Any PR or suggestions about the VAD would be welcome.

ei23fxg · 2025-01-22T17:11:38Z

Allright, but i'm using insanely_fast_whisper
So thats the same there?

jhj0517 · 2025-01-22T17:19:40Z

Yeah, regardless of the implementations (the main difference between implementations is speed and VRAM efficiency, the result should be the same if you use the same model), it should be the same.

I hope using whisperX's VAD implementation will improve this kind of problem.

And if there is any hallucination, I would recommend using large-v2 over large-v3 because it tends to be more robust against noisy audio - #152 (comment)

ei23fxg · 2025-01-22T17:25:41Z

Ok, but I just tried faster-whisper, it's looking better there on large-v3-turbo
Close to the original... with the cost of a slower runtime

1
00:00:00,000 --> 00:00:04,000
Hey, everyone. Ready to dive into something really mind-bending?

2
00:00:04,180 --> 00:00:05,300
Always up for a challenge.

3
00:00:06,400 --> 00:00:20,900
Today, we're going to explore this wild idea that coherence might be like the secret ingredient to intelligence, consciousness, and maybe even like the universe itself.

4
00:00:20,960 --> 00:00:21,400
I love it.

5
00:00:21,460 --> 00:00:22,220
Pretty ambitious, right?

6
00:00:22,340 --> 00:00:22,540
Yeah.

7
00:00:22,660 --> 00:00:27,280
But we're not just theorizing here. We've got some seriously cool source material to work with.

8
00:00:27,280 --> 00:00:31,960
Yeah. So this researcher, David Shapiro, had this amazing conversation with, get this, Claude.

9
00:00:32,020 --> 00:00:35,060
One of Anthropik's, like, star large language models.

10
00:00:35,140 --> 00:00:38,540
Exactly. And they're back and forth. It's not just like techie jargon.

11
00:00:38,680 --> 00:00:38,960
No, no.

12
00:00:39,000 --> 00:00:40,620
It's like philosophy meets AI.

13
00:00:41,040 --> 00:00:43,420
Yeah. They go deep on what makes us, you know, us.

jhj0517 · 2025-01-22T17:52:16Z

Hmm. Probably different use of parameter caused that. Like openai/whisper uses beam_size 1 by default and faster-whisper uses 5 by default.

ei23fxg · 2025-01-22T19:16:27Z

No, that is not the case. Tried some beam size settings:
1-10 no change
15 as sample, also not...

Of course, my Docker updates were not at the same time as the updates in the repository, but I can say that the problem became visible to me around 17th of January 25.

Since the Docker container has no versioning, I cannot identify an exact version.

jhj0517 · 2025-01-23T06:08:03Z

I meant just an example. Since insanely-fast-whisper uses transformer's AutomaticSpeechRecognitionPipeline implementation, it uses some different parameters than faster-whipser.

If this suddenly happened when using insanely-fast-whisper, I guess there are some updates to transformers.
Latest version of the transformers was released two days ago, so I think I should look into that if it happens by transformers's updates.

ei23fxg · 2025-01-23T08:24:36Z

Must have happend between 01/01/2025 and 17/01/2025

ei23fxg · 2025-01-24T17:49:43Z

I actually just did an repository checkout from ad418ca and built the Docker image myself.
At that point, I am very sure, the error was not present.
But rebuilding the older version of the image won't resolve the issue.

So there must be kind of an external dependency that causes the problem.

Maybe it would make sense to switch to the ghcr and use tags / image versioning.

ei23fxg · 2025-01-26T00:36:34Z

I meant just an example. Since insanely-fast-whisper uses transformer's AutomaticSpeechRecognitionPipeline implementation, it uses some different parameters than faster-whipser.

If this suddenly happened when using insanely-fast-whisper, I guess there are some updates to transformers. Latest version of the transformers was released two days ago, so I think I should look into that if it happens by transformers's updates.

Yeah that's it,
I changed requirements to transformers==4.47.1 and built the docker image.
That works for now with insanely-fast-whisper
transformers==4.48.0 or above however is not.

Shall I PR that @jhj0517 ?
Maybe a better fix is needed tho.

jhj0517 · 2025-01-26T03:49:55Z

@ei23fxg Yeah I'd appreciate it if you do that.

Fix for issue with insanely-fast-whisper see jhj0517#470

ei23fxg · 2025-01-26T23:57:58Z

Fixed with #484

fixed transformers version for issue #470

ei23fxg added the bug Something isn't working label Jan 22, 2025

ei23fxg assigned jhj0517 Jan 22, 2025

jhj0517 added this to the vad milestone Jan 22, 2025

jhj0517 added the hallucination hallucination of the models label Jan 22, 2025

jhj0517 closed this as completed Jan 22, 2025

jhj0517 reopened this Jan 22, 2025

ei23fxg added a commit to ei23fxg/Whisper-WebUI that referenced this issue Jan 26, 2025

fixed transformers version

a071431

Fix for issue with insanely-fast-whisper see jhj0517#470

ei23fxg mentioned this issue Jan 26, 2025

fixed transformers version for issue #470 #484

Merged

ei23fxg closed this as completed Jan 26, 2025

jhj0517 added a commit that referenced this issue Jan 27, 2025

Merge pull request #484 from ei23fxg/master

2ea3ad3

fixed transformers version for issue #470

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some SRT seqments are way to big - Silero VAD problem #470

Some SRT seqments are way to big - Silero VAD problem #470

ei23fxg commented Jan 22, 2025 •

edited

Loading

jhj0517 commented Jan 22, 2025

ei23fxg commented Jan 22, 2025 •

edited

Loading

ei23fxg commented Jan 22, 2025

jhj0517 commented Jan 22, 2025

ei23fxg commented Jan 22, 2025 •

edited

Loading

jhj0517 commented Jan 22, 2025 •

edited

Loading

ei23fxg commented Jan 22, 2025

jhj0517 commented Jan 22, 2025

ei23fxg commented Jan 22, 2025 •

edited

Loading

jhj0517 commented Jan 23, 2025

ei23fxg commented Jan 23, 2025

ei23fxg commented Jan 24, 2025 •

edited

Loading

ei23fxg commented Jan 26, 2025

jhj0517 commented Jan 26, 2025

ei23fxg commented Jan 26, 2025

Some SRT seqments are way to big - Silero VAD problem #470

Some SRT seqments are way to big - Silero VAD problem #470

Comments

ei23fxg commented Jan 22, 2025 • edited Loading

jhj0517 commented Jan 22, 2025

ei23fxg commented Jan 22, 2025 • edited Loading

ei23fxg commented Jan 22, 2025

jhj0517 commented Jan 22, 2025

ei23fxg commented Jan 22, 2025 • edited Loading

jhj0517 commented Jan 22, 2025 • edited Loading

ei23fxg commented Jan 22, 2025

jhj0517 commented Jan 22, 2025

ei23fxg commented Jan 22, 2025 • edited Loading

jhj0517 commented Jan 23, 2025

ei23fxg commented Jan 23, 2025

ei23fxg commented Jan 24, 2025 • edited Loading

ei23fxg commented Jan 26, 2025

jhj0517 commented Jan 26, 2025

ei23fxg commented Jan 26, 2025

ei23fxg commented Jan 22, 2025 •

edited

Loading

ei23fxg commented Jan 22, 2025 •

edited

Loading

ei23fxg commented Jan 22, 2025 •

edited

Loading

jhj0517 commented Jan 22, 2025 •

edited

Loading

ei23fxg commented Jan 22, 2025 •

edited

Loading

ei23fxg commented Jan 24, 2025 •

edited

Loading