-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct the synced lyrics heuristically #1
Comments
FYI @arsaboo I've now opened this issue to shift the focus onto solving this specific challenge in this repo. If we can improve the quality of the transcribed lyrics as I've described above, that will be a huge win for the automated-karaoke challenge! 😀 However, I should also mention that there's some very recent (published 2 months ago) internal research work at Spotify demonstrating an approach to this which is much more exciting to me than my own rough algorithm described above: https://arxiv.org/pdf/2306.07744.pdf It sounds like they've successfully trained a model specifically for lyrics syncing, and they've published a super helpful repo of royalty-free music (broad mix of genres) with lyrics and synced timestamps here: https://github.com/f90/jamendolyrics That'll probably be useful to test the accuracy of pure whisper, vs. whisper + my primitive known lyrics-matching algorithm above. I'm kinda hoping someone with more understanding of more of the words in this research paper will actually be able to reproduce their results and share their model training code and the actual trained model with the world as an open source project, potentially making this whole issue redundant! Not gonna hold out hope for too long though, as Spotify most likely want to keep their model an internal secret to give them a competitive advantage (e.g. being able to provide accurate synced lyrics for any song in the Spotify app without needing to license them from musicxmatch any more)... Just wanted to flag this in case it was of interest to you, or in case anyone else with more ML experience than I reads this and gets excited about training a model for lyrics syncing which could change the approach here! Oh, and huge thanks to Adam from Youka (@youkaclub) for telling me about this research paper - I hadn't come across it till last week and it's got me kinda excited about tackling this problem space again 😄 |
If we assume that the lyrics downloaded from the internet are 100% correct (which are likely not, as @beveradb said), a better approach could be to directly sync those lyrics to the song. I have found this repo witch has the goal of enhancing Whisper capabilities. In this package, they've added a |
@beveradb, @Turtle6665 another repo worth checking out is https://github.com/Japan7/yohane. At command line, I was able to get a workable .ass file from it with "poetry run yohane -e None test.mp3 lyrics.txt" (-e None means no vocal separation, chose this to speed things up so I could check it out). Using the .ass file that was created, I then ran "ffmpeg -i test.mp3 -c:v libx264 -c:a libmp3lame -vf ass=test.ass -shortest test.mp4" to create the mp4 for viewing and the result for the song I selected was decent enough although it did require some tweaking of the .ass file. You can then do that with an app called Aegisub (there are youtube videos on how to use it, not a huge learning curve) and although fairly old, it still works pretty well in getting a polished .ass file that you can use with an instrumental to create a decent karaoke track. |
This issue is a follow-on from this short thread in a related side-project; nomadkaraoke/python-audio-separator#8 (comment)
Problem:
lyrics-transcriber
currently transcribes the given audio file using whisper-timestamped and writes the detected words to a lyrics file directly with no cleanup or modification.This results in very variable accuracy for the lyrics output, as Whisper is far from perfect at correctly detecting lyrics from music audio.
For an example, compare these two synced lyrics videos:
Fortunately, for the majority of songs, as long as we know the artist and title, we can download lyrics from the internet and hopefully use this to correct the detected lyrics from Whisper.
I've already implemented the fetch of lyrics from both genius and spotify.
This issue is to track the implementation of the hard part - using those lyrics to correct the detected lyrics.
Before discussing ways to approach this, it's worth being aware of the biggest limitations first:
1 - Lyrics from the internet are often wrong in various ways
Common examples include:
2 - Whisper-timestamped transcriptions are almost always wrong in various places
So, given these challenges, I'm holding out hope for the following approach (roughly):
This is a super rough set of thoughts though, and I'm sure the reality of this approach will become apparent when attempting to implement ;)
The text was updated successfully, but these errors were encountered: