Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some SRT seqments are way to big - Silero VAD problem #470

Closed
ei23fxg opened this issue Jan 22, 2025 · 15 comments
Closed

Some SRT seqments are way to big - Silero VAD problem #470

ei23fxg opened this issue Jan 22, 2025 · 15 comments
Assignees
Labels
bug Something isn't working hallucination hallucination of the models
Milestone

Comments

@ei23fxg
Copy link
Contributor

ei23fxg commented Jan 22, 2025

Which OS are you using?

  • Linux Debian, running the Docker Container with cuda
  • insanely fast whisper
  • default parameters
  • model: whisper-large-v3-turbo

When I create a timestamp file, I sometimes get very large sections that are more or less useless to use for subtitles.
I'm not entirely sure if this has always been the case or if it came with an update. I suspect it has something to do with the SILERO VAD filter.
If the SILERO VAD filter is disabled, the problem definitely occurs frequently.
I have already tested some settings on the SILERO VAD filter. This changes the results to some extent. However, I have not been able to achieve the desired result.
I'm sure, im doing something wrong, but others will surely run into this issue. It would be useful to mention some best practise about this in the documentation.

large-v3 has the same issue.

Below is an SRT example for illustrational purpose.

...
9
00:02:51,580 --> 00:07:05,080
Pellentesque bibendum mollis fames maecenas quis in sapien. Ridiculus placerat interdum rutrum auctor posuere aliquet viverra vivamus. Dis neque fermentum dapibus lobortis tempus ut. Ultricies phasellus posuere nisl magna suspendisse semper. Eleifend ipsum enim et orci; ridiculus in nam donec placerat. Dapibus integer sapien mi massa cubilia.
Malesuada fermentum venenatis, blandit justo magna condimentum at? Ac nibh aliquet sodales nullam praesent euismod. Montes class congue donec sodales, nullam eget sem. Facilisis sociosqu aptent volutpat habitant elementum finibus condimentum. Lacus sit praesent id ex accumsan sed in erat. Erat finibus egestas ex ullamcorper cras mauris nibh laoreet. Dapibus ut mollis dapibus nulla lacinia mattis. Viverra sapien cras netus maximus orci. Interdum metus et integer mauris pulvinar gravida.
Rutrum quis est ultrices faucibus semper nibh pellentesque scelerisque. Proin arcu parturient praesent ligula eros dis consequat primis felis. Vitae himenaeos dui at conubia praesent etiam. Auctor platea eleifend ante sed nec euismod dapibus ante dignissim. Fames velit pretium fames placerat felis turpis commodo! Senectus tempus torquent commodo nisl laoreet consequat. Montes est magna auctor; accumsan non pulvinar consequat. Erat tellus torquent porttitor duis laoreet volutpat dictumst. Porttitor dignissim at etiam lacus morbi.

10
00:07:05,560 --> 00:07:10,180
Lorem ipsum odor amet, consectetuer adipiscing elit. Pellentesque dictum

11
00:07:10,520 --> 00:07:15,040
Erat finibus egestas ex ullamcorper cras mauris nibh laoreet
...
@ei23fxg ei23fxg added the bug Something isn't working label Jan 22, 2025
@jhj0517 jhj0517 added this to the vad milestone Jan 22, 2025
@jhj0517 jhj0517 added the hallucination hallucination of the models label Jan 22, 2025
@jhj0517
Copy link
Owner

jhj0517 commented Jan 22, 2025

Hi. Could you try the VAD again with these settings in the "Voice Detection Filter" tab:
Image

And see if it still happens?

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 22, 2025

Hi. Could you try the VAD again with these settings in the "Voice Detection Filter" tab: Image

And see if it still happens?

Hi, thanks for fast reply.

Yes, I tried that.
activated, deactivated VAD filter makes no difference.
threshold and speech padding made some difference, but I guess that's to be expected.
Same behavior for large V3 (but different result)

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 22, 2025

I found an old subtitle file that I had generated, i think it was large v3 back on 2024-11-25

This is what it looked like back then:


`1
00:00:00,000 --> 00:00:03,940
Hey everyone, ready to dive into something really mind-bending?

2
00:00:04,139 --> 00:00:05,240
Always up for a challenge.

3
00:00:06,320 --> 00:00:15,380
Today we're going to explore this wild idea that coherence might be like the secret ingredient

4
00:00:15,380 --> 00:00:20,839
to intelligence, consciousness, and maybe even like the universe itself.

5
00:00:20,899 --> 00:00:21,339
I love it.

6
00:00:21,399 --> 00:00:22,160
Pretty ambitious, right?

7
00:00:22,280 --> 00:00:22,480
Yeah.

8
00:00:22,600 --> 00:00:27,219
But we're not just theorizing here, we've got some seriously cool source material to work with.

9
00:00:27,280 --> 00:00:27,480
Yeah.

10
00:00:27,559 --> 00:00:31,859
So this researcher, David Shapiro, had this amazing conversation with, get this, Claude.

11
00:00:31,960 --> 00:00:34,979
One of Anthropik's, like, star large language models.

12
00:00:35,060 --> 00:00:35,420
Exactly.

13
00:00:35,619 --> 00:00:36,659
And they're back and forth.

and thats today

`1
00:00:00,000 --> 00:02:06,340
Hey everyone, ready to dive into something really mind-bending? Always up for a challenge. Today we're going to explore this wild idea that coherence might be like the secret ingredient to intelligence, consciousness, and maybe even like the universe itself. I love it. Pretty ambitious, right? Yeah. But we're not just theorizing here, we've got some seriously cool source material to work with. Yeah. So this researcher, David Shapiro, had this amazing conversation with, get this, Claude. One of Anthropics, like, star large language models. Exactly. And they're back and forth. It's not just like techie jargon. No, no. It's like philosophy meets AI. Yeah. They go deep on what makes us, you know, us. What is the essence of being human? Exactly. And by the end of this deep dive, I guarantee you'll see things in a whole new way. Totally. Evolution, politics, even your own decisions, everything will look different. So where do we even start with this coherence thing? Like, what are we even talking about? Okay, good question. So Claude, the AI, describes coherence as this like fundamental drive towards stable, organized patterns. And this kind of aversion to anything that like doesn't fit. So it's like our brains are always trying to make sense of the world, find the order in the chaos. Right. OK, I can see that. But where does this actually show up? Like in real life? Well, think about evolution for a second. The species and adaptations that survive are the ones that, you know, fit their environment best, the ones that are most coherent with their surroundings. So it's like survival of the fittest, but fittest actually means most coherent. Exactly. And this coherence thing, it's not just biology. Claude argues it's like fundamental to how we think too. Really? How so? I'm intrigued. Okay. Think about political ideologies, religions, even your personal philosophy. They're all trying to build this like complete internally consistent model of reality. So even when ideologies clash, they're each striving for coherence within their own little bubble. Yeah, precisely. And this clash of coherent systems, it gets even bigger when you think about geopolitics.

2
00:02:06,920 --> 00:02:46,420
Nations forming blocks, vying for power, trying to create stability in their own spheres of influence. Wow. So from tiny organisms evolving to like global power struggles. Yeah. It's all about this drive for coherence. It's pretty mind blowing, right? Yeah, it is. And then get this, Claude suggests that even consciousness itself could come from this same drive. Whoa. Okay, elaborate on that. So picture consciousness as this high-level process that's constantly trying to create a unified model of you, of your experiences, your place in the world, always trying to make sense of everything. So my brain right now trying to wrap itself around this concept is actually demonstrating coherence in action.

3
00:02:46,960 --> 00:02:47,440
That's meta.`

Same File

@jhj0517
Copy link
Owner

jhj0517 commented Jan 22, 2025

It seems like that this is a faster-whisper's VAD implementation issue, which is also tracked by other issues.

According to #396 (comment), whisperx's VAD implementation gives better results, so I'm planning to add it later.

Any PR or suggestions about the VAD would be welcome.

@jhj0517 jhj0517 closed this as completed Jan 22, 2025
@jhj0517 jhj0517 reopened this Jan 22, 2025
@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 22, 2025

Allright, but i'm using insanely_fast_whisper
So thats the same there?

@jhj0517
Copy link
Owner

jhj0517 commented Jan 22, 2025

Yeah, regardless of the implementations (the main difference between implementations is speed and VRAM efficiency, the result should be the same if you use the same model), it should be the same.

I hope using whisperX's VAD implementation will improve this kind of problem.

And if there is any hallucination, I would recommend using large-v2 over large-v3 because it tends to be more robust against noisy audio - #152 (comment)

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 22, 2025

Ok, but I just tried faster-whisper, it's looking better there on large-v3-turbo
Close to the original... with the cost of a slower runtime

1
00:00:00,000 --> 00:00:04,000
Hey, everyone. Ready to dive into something really mind-bending?

2
00:00:04,180 --> 00:00:05,300
Always up for a challenge.

3
00:00:06,400 --> 00:00:20,900
Today, we're going to explore this wild idea that coherence might be like the secret ingredient to intelligence, consciousness, and maybe even like the universe itself.

4
00:00:20,960 --> 00:00:21,400
I love it.

5
00:00:21,460 --> 00:00:22,220
Pretty ambitious, right?

6
00:00:22,340 --> 00:00:22,540
Yeah.

7
00:00:22,660 --> 00:00:27,280
But we're not just theorizing here. We've got some seriously cool source material to work with.

8
00:00:27,280 --> 00:00:31,960
Yeah. So this researcher, David Shapiro, had this amazing conversation with, get this, Claude.

9
00:00:32,020 --> 00:00:35,060
One of Anthropik's, like, star large language models.

10
00:00:35,140 --> 00:00:38,540
Exactly. And they're back and forth. It's not just like techie jargon.

11
00:00:38,680 --> 00:00:38,960
No, no.

12
00:00:39,000 --> 00:00:40,620
It's like philosophy meets AI.

13
00:00:41,040 --> 00:00:43,420
Yeah. They go deep on what makes us, you know, us.

@jhj0517
Copy link
Owner

jhj0517 commented Jan 22, 2025

Hmm. Probably different use of parameter caused that. Like openai/whisper uses beam_size 1 by default and faster-whisper uses 5 by default.

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 22, 2025

No, that is not the case. Tried some beam size settings:
1-10 no change
15 as sample, also not...

Of course, my Docker updates were not at the same time as the updates in the repository, but I can say that the problem became visible to me around 17th of January 25.

Since the Docker container has no versioning, I cannot identify an exact version.

@jhj0517
Copy link
Owner

jhj0517 commented Jan 23, 2025

I meant just an example. Since insanely-fast-whisper uses transformer's AutomaticSpeechRecognitionPipeline implementation, it uses some different parameters than faster-whipser.

If this suddenly happened when using insanely-fast-whisper, I guess there are some updates to transformers.
Latest version of the transformers was released two days ago, so I think I should look into that if it happens by transformers's updates.

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 23, 2025

Must have happend between 01/01/2025 and 17/01/2025

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 24, 2025

I actually just did an repository checkout from ad418ca and built the Docker image myself.
At that point, I am very sure, the error was not present.
But rebuilding the older version of the image won't resolve the issue.

So there must be kind of an external dependency that causes the problem.

Maybe it would make sense to switch to the ghcr and use tags / image versioning.

@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 26, 2025

I meant just an example. Since insanely-fast-whisper uses transformer's AutomaticSpeechRecognitionPipeline implementation, it uses some different parameters than faster-whipser.

If this suddenly happened when using insanely-fast-whisper, I guess there are some updates to transformers. Latest version of the transformers was released two days ago, so I think I should look into that if it happens by transformers's updates.

Yeah that's it,
I changed requirements to transformers==4.47.1 and built the docker image.
That works for now with insanely-fast-whisper
transformers==4.48.0 or above however is not.

Shall I PR that @jhj0517 ?
Maybe a better fix is needed tho.

@jhj0517
Copy link
Owner

jhj0517 commented Jan 26, 2025

@ei23fxg Yeah I'd appreciate it if you do that.

ei23fxg added a commit to ei23fxg/Whisper-WebUI that referenced this issue Jan 26, 2025
Fix for issue with insanely-fast-whisper see
jhj0517#470
@ei23fxg
Copy link
Contributor Author

ei23fxg commented Jan 26, 2025

Fixed with #484

@ei23fxg ei23fxg closed this as completed Jan 26, 2025
jhj0517 added a commit that referenced this issue Jan 27, 2025
fixed transformers version for issue #470
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hallucination hallucination of the models
Projects
None yet
Development

No branches or pull requests

2 participants