Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

audiobook: support reading from audio files in TTS #353

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

teleshoes
Copy link

@teleshoes teleshoes commented Apr 12, 2023

NOTE: this is a working audiobook impl, but it is FAR from polished, and it is NOT plug-and-play. i am already using it all the time, though, so i figured i'd stick it here in case someone else is interested.

FEATURES

  • supports reading from mp3/ogg/flac/wav files, instead of android TTS engines, inside the TTS ReadAloud Module
  • supports sentence navigation, pause/play/next/prev, etc
  • moves the selected sentence as playback progresses, and plays from the selected sentence
  • does not assume audiobook and ebook are identical, makes best guesses and moves along
    • even large missing or additional sections generally work fine
      • if the audiobook plays a section missing from the ebook, the visual sentence selector waits for the audio to catch up
      • if the ebook has a section missing from the audiobook, the visual sentence selector skips a sentence every half second until it catches up.
  • does not skip over audiobook intros/music etc. the full audiobook is played, and the place in the book moves along while the content is read

USAGE

  • requires creating a .wordtiming text file for each e-book OUTSIDE OF COOLREADER
  • once created, place the wordtiming file, ebook, and audiobook mp3/flac/etc files in the same directory
  • open the ebook as normal, and start TTS. if there is a wordtiming file, the audiobook is used

EXAMPLE

  • attached is A Christmas Carol, by Charles Dickens, from Project Gutenberg and Librivox (all works in the public domain)
  • download all files, and extract the zip file to any directory on the device
  • wordtiming file was generated with this script:
    ebook-audiobook-wordtiming \
      a_christmas_carol_charles_dickens_project_gutenberg_19337.txt \
      A_Christmas_Carol*.flac \
      -o a_christmas_carol_charles_dickens_project_gutenberg_19337.wordtiming
    

}
} catch(Exception e){
//ignore
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not refactor to try-with-exception paradigm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@teleshoes
Copy link
Author

perl

  • P1) vosk-timing-data - run vosk-words-json on WAV files, get statistics on each word
    • output is a big JSON file
    • this is the only CPU-intensive/long-running step
    • the result is LZMA compressed and cached
    • the cache can be generated with or without an ebook, for processing later
    • (i plan on running this on all audiobooks as i acquire them)
  • P2) audio-word-timing - process vosk-timing-data into a CSV with three columns: AUDIO_WORD,START_TIME_SECONDS,AUDIO_FILE
    • this is the start time of each word in the AUDIOBOOK
  • P3) audio-word-list - make a copy of audio-word-timing and remove the START_TIME_SECONDS column
  • P4) ebook-word-list - process the EPUB/FB2/TXT file into a list of words (one word per line) with pandoc
    • this step tries to apply the same rules the coolreader will use later
  • P4) ebook-audio-diff - align audio-word-list and ebook-word-list
    • this is the ONLY STEP that compares the ebook to the audiobook
    • this step handles bad spelling, bad pronunciation, proper nouns, missing passages, extra words, skipped sentences, footnotes read in line instead of at the end of the chapter, EVERYTHING that is different between the audiobook and the ebook
    • it uses the myers difference algorithm, finding the Longest-Common-Subsequence
    • i.e.: its literally just diff -y
  • P5) ebook-word-timing - combine ebook-word-list, ebook-audio-diff, and audio-word-timing to get ebook timing
    • take ebook-word-list and add two column, START_TIME_SECONDS and AUDIO_FILE
    • using ebook-audio-diff, find the highest index of the word in audio-word-list that is not part of the longest-common-subsequence of a later word
    • take that index and get the timing from audio-word-timing, and fill in START_TIME_SECONDS/AUDIO_FILE columns with it
    • after this point, the words in the audiobook are not used and never appear again
      • you will never see mispronounced words, spoken errors, etc. every word from here on out appears, in order, in the FB2/EPUB/TXT
    • this is the start time of each word in the EBOOK
    • this is the contents of *.wordtiming, and is the final output of the perl script

coolreader

  • CR1) sentence-info - when audiobook-tts starts, navigate to each sentence and get info
    • jump to page 0
    • select the first sentence on the page
    • select the next sentence, repeatedly, until there are no sentences left in the book
    • after selecting each sentence, record the sentence-text and the dom-start-pos
    • this is done in CPP, invoked from a JNI file, and sent to java as a List<SentenceInfo`
    • this is turned into an (improper) CSV with two columns, START_POS and TEXT
      • start pos is a DOM id, from ldomXPointerEx->toString() (it never has any commas)
      • TEXT is allowed to contain commas, because its the last column (hence, this is an improper CSV)
      • e.g.: /text/p[45].135, having little or no money in my pocket
    • java caches this sentence info in a file, *.sentenceinfo, if coolreader has write perms where the ebook is
    • this step is the only slow part in coolreader
    • it is purely the coolreader sentence structure parsing
    • it has NOTHING to do with audiobooks, or wordtimings, or anything
  • CR2) sentence-words - parse each sentence into a list of words, as close to step P4) as possible
    • its never exactly the same, because pandoc is not coolreader, but the difference is SMALL
  • CR3) sentence-start-times - compare sentence-words to ebook-word-timing file
    • load *.wordtiming, parse into a list of word/start-time pairs
    • for each sentence, take each word in that sentence and try to apply to the next word in wordtiming
    • a sentence must match EVERY SINGLE WORD in sentence-words to a word in wordtiming
    • however, allow skipping up to 20 WORDTIMING words (this is arbitrary, but if its very long, passages could be skipped)
    • if a sentence does not match every word (happens all the time), consume the words that did match, use the lowest start time matched, and continue to the next sentence
    • the output here is a start TIME and start PHYSICAL-POSITION for every sentence (this is the final pre-processing goal)
  • CR4) start-playback - play audiobook instead of TTS
    • when tts-start is invoked, get the DOM location of the start of the initially selected sentence
    • select the closest sentence in sentence-start-times
    • open the audio file, and seek to the start time
    • NOTE: if this is the first sentence in the audio file, ALWAYS seek to 0
      • this way, you hear the full audiobook
      • start music, copyright, "Start of CD Number Three", etc
  • CR5) continue-playback - select the next sentence as audiobook position continues
    • do not stop playback, ever, without user interaction (this way, you get to hear the end of each audiobook file)
    • when media playback time is after the next sentence in sentence-start-times, select the next sentence as if user clicked Next >>
    • when media playback STOPS, and the next sentence is the FIRST sentence of a new audio file, start the next audio file

@teleshoes
Copy link
Author

moved ebook-audiobook-wordtiming to its own repo:
https://github.com/teleshoes/ebook-audiobook-wordtiming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants