Correct the synced lyrics heuristically #1

beveradb · 2023-08-21T21:08:39Z

This issue is a follow-on from this short thread in a related side-project; nomadkaraoke/python-audio-separator#8 (comment)

Problem: lyrics-transcriber currently transcribes the given audio file using whisper-timestamped and writes the detected words to a lyrics file directly with no cleanup or modification.

This results in very variable accuracy for the lyrics output, as Whisper is far from perfect at correctly detecting lyrics from music audio.
For an example, compare these two synced lyrics videos:

Synced manually by me with correct lyrics and manual syncing using MidiCo. Manual synced .lrc file here for comparison.
Generated using lyrics-transcriber - pure Whisper transcription only, taking the output .lrc file and loading that into MidiCo

Fortunately, for the majority of songs, as long as we know the artist and title, we can download lyrics from the internet and hopefully use this to correct the detected lyrics from Whisper.

I've already implemented the fetch of lyrics from both genius and spotify.

This issue is to track the implementation of the hard part - using those lyrics to correct the detected lyrics.

Before discussing ways to approach this, it's worth being aware of the biggest limitations first:

1 - Lyrics from the internet are often wrong in various ways
Common examples include:

Missing repetitions of chorus/refrain or bridge sections of songs
Missing intro or outro sections
Wrong/incorrect words, e.g. where a person typing up the lyrics has misheard
Wrong/incorrect words, e.g. where the "official" lyrics don't match what ended up actually being sung by the artist in the commercial recording

2 - Whisper-timestamped transcriptions are almost always wrong in various places

It will almost always have some words which are wrong, depending on the singers style, accent, background music, recording quality, etc. This is especially likely when the lyrics include names or less common words, and are sometimes hilarious to read, e.g. mishearing "Whitehall" as "Phytol" in one song I recently created a karaoke version of 😄
While it usually gets the timestamps of words correctly (even if the word itself is wrong), there are still some issues with this which may need to be solved in the whisper-timestamped project itself, e.g. it commonly gets the timestamp of the very first word wrong, and occasionally starts sentences too soon.
Fortunately, it at least provides a confidence score for each detected word, which we can hopefully use to improve the transcription by replacing low confidence words with more-likely words from the internet lyrics

So, given these challenges, I'm holding out hope for the following approach (roughly):

Take the internet lyrics and split those up into lines (both genius and spotify if both were successfully fetched)
For each line returned from the whisper transcription, find a couple of "anchor words" which have a high confidence score
Attempt to match up the line with a lyrics line from the internet lyrics using these "anchor words"
Attempt to replace the low confidence (less than 50%?) words with words from the matched internet lyrics line, potentially replacing the entire line if there are multiple low confidence words in the line or if the number of words doesn't match up

This is a super rough set of thoughts though, and I'm sure the reality of this approach will become apparent when attempting to implement ;)

The text was updated successfully, but these errors were encountered:

beveradb · 2023-08-21T21:36:51Z

FYI @arsaboo I've now opened this issue to shift the focus onto solving this specific challenge in this repo.

If we can improve the quality of the transcribed lyrics as I've described above, that will be a huge win for the automated-karaoke challenge! 😀

However, I should also mention that there's some very recent (published 2 months ago) internal research work at Spotify demonstrating an approach to this which is much more exciting to me than my own rough algorithm described above: https://arxiv.org/pdf/2306.07744.pdf

It sounds like they've successfully trained a model specifically for lyrics syncing, and they've published a super helpful repo of royalty-free music (broad mix of genres) with lyrics and synced timestamps here: https://github.com/f90/jamendolyrics

That'll probably be useful to test the accuracy of pure whisper, vs. whisper + my primitive known lyrics-matching algorithm above.

I'm kinda hoping someone with more understanding of more of the words in this research paper will actually be able to reproduce their results and share their model training code and the actual trained model with the world as an open source project, potentially making this whole issue redundant!

Not gonna hold out hope for too long though, as Spotify most likely want to keep their model an internal secret to give them a competitive advantage (e.g. being able to provide accurate synced lyrics for any song in the Spotify app without needing to license them from musicxmatch any more)...

Just wanted to flag this in case it was of interest to you, or in case anyone else with more ML experience than I reads this and gets excited about training a model for lyrics syncing which could change the approach here!

Oh, and huge thanks to Adam from Youka (@youkaclub) for telling me about this research paper - I hadn't come across it till last week and it's got me kinda excited about tackling this problem space again 😄

Turtle6665 · 2023-12-04T20:54:24Z

If we assume that the lyrics downloaded from the internet are 100% correct (which are likely not, as @beveradb said), a better approach could be to directly sync those lyrics to the song. I have found this repo witch has the goal of enhancing Whisper capabilities. In this package, they've added a model.align() function that sync the transcript (here, the lyrics) to the audio (more info here). What do you think about it ?

kickaxe60 · 2023-12-08T21:30:20Z

@beveradb, @Turtle6665 another repo worth checking out is https://github.com/Japan7/yohane. At command line, I was able to get a workable .ass file from it with "poetry run yohane -e None test.mp3 lyrics.txt" (-e None means no vocal separation, chose this to speed things up so I could check it out). Using the .ass file that was created, I then ran "ffmpeg -i test.mp3 -c:v libx264 -c:a libmp3lame -vf ass=test.ass -shortest test.mp4" to create the mp4 for viewing and the result for the song I selected was decent enough although it did require some tweaking of the .ass file. You can then do that with an app called Aegisub (there are youtube videos on how to use it, not a huge learning curve) and although fairly old, it still works pretty well in getting a polished .ass file that you can use with an instrumental to create a decent karaoke track.

beveradb mentioned this issue Aug 21, 2023

Add scrolling lyrics for Karaoke videos nomadkaraoke/python-audio-separator#8

Closed

beveradb mentioned this issue Nov 30, 2023

api key error #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct the synced lyrics heuristically #1

Correct the synced lyrics heuristically #1

beveradb commented Aug 21, 2023 •

edited

Loading

beveradb commented Aug 21, 2023 •

edited

Loading

Turtle6665 commented Dec 4, 2023 •

edited

Loading

kickaxe60 commented Dec 8, 2023 •

edited

Loading

Correct the synced lyrics heuristically #1

Correct the synced lyrics heuristically #1

Comments

beveradb commented Aug 21, 2023 • edited Loading

beveradb commented Aug 21, 2023 • edited Loading

Turtle6665 commented Dec 4, 2023 • edited Loading

kickaxe60 commented Dec 8, 2023 • edited Loading

beveradb commented Aug 21, 2023 •

edited

Loading

beveradb commented Aug 21, 2023 •

edited

Loading

Turtle6665 commented Dec 4, 2023 •

edited

Loading

kickaxe60 commented Dec 8, 2023 •

edited

Loading