audiobook: support reading from audio files in TTS #353

teleshoes · 2023-04-12T19:02:21Z

NOTE: this is a working audiobook impl, but it is FAR from polished, and it is NOT plug-and-play. i am already using it all the time, though, so i figured i'd stick it here in case someone else is interested.

FEATURES

supports reading from mp3/ogg/flac/wav files, instead of android TTS engines, inside the TTS ReadAloud Module
supports sentence navigation, pause/play/next/prev, etc
moves the selected sentence as playback progresses, and plays from the selected sentence
does not assume audiobook and ebook are identical, makes best guesses and moves along
- even large missing or additional sections generally work fine
  - if the audiobook plays a section missing from the ebook, the visual sentence selector waits for the audio to catch up
  - if the ebook has a section missing from the audiobook, the visual sentence selector skips a sentence every half second until it catches up.
does not skip over audiobook intros/music etc. the full audiobook is played, and the place in the book moves along while the content is read

USAGE

requires creating a .wordtiming text file for each e-book OUTSIDE OF COOLREADER
- this file can be generated from any speech-to-text system like vosk that supports per-word timings
- ebook-audiobook-wordtiming is a script for generating this using vosk, pandoc, gnu-diff, python, + perl
- see:
  - https://github.com/teleshoes/ebook-audiobook-wordtiming
    (for vosk-words-json and ebook-audiobook-wordtiming)
  - https://github.com/alphacep/vosk-api
  - https://pypi.org/project/vosk/
once created, place the wordtiming file, ebook, and audiobook mp3/flac/etc files in the same directory
open the ebook as normal, and start TTS. if there is a wordtiming file, the audiobook is used

EXAMPLE

attached is A Christmas Carol, by Charles Dickens, from Project Gutenberg and Librivox (all works in the public domain)
- https://drive.google.com/drive/folders/1abepyfOW9on94tiZpoEdN4QBYuud_jk4?usp=sharing
- includes: e-book (.txt), audiobook, (.flac), and wordtiming file, (*.wordtiming)
download all files, and extract the zip file to any directory on the device

wordtiming file was generated with this script:

ebook-audiobook-wordtiming \
  a_christmas_carol_charles_dickens_project_gutenberg_19337.txt \
  A_Christmas_Carol*.flac \
  -o a_christmas_carol_charles_dickens_project_gutenberg_19337.wordtiming

this is unrelated to audiobook word matching, strictly about comparing ebook word splitting in coolreader to external

plotn · 2023-04-19T13:45:33Z

android/src/org/coolreader/crengine/WordTimingAudiobookMatcher.java

+				}
+			} catch(Exception e){
+				//ignore
+			}


why not refactor to try-with-exception paradigm

teleshoes · 2023-04-19T22:15:15Z

perl

P1) vosk-timing-data - run vosk-words-json on WAV files, get statistics on each word
- output is a big JSON file
- this is the only CPU-intensive/long-running step
- the result is LZMA compressed and cached
- the cache can be generated with or without an ebook, for processing later
- (i plan on running this on all audiobooks as i acquire them)
P2) audio-word-timing - process vosk-timing-data into a CSV with three columns: AUDIO_WORD,START_TIME_SECONDS,AUDIO_FILE
- this is the start time of each word in the AUDIOBOOK
P3) audio-word-list - make a copy of audio-word-timing and remove the START_TIME_SECONDS column
P4) ebook-word-list - process the EPUB/FB2/TXT file into a list of words (one word per line) with pandoc
- this step tries to apply the same rules the coolreader will use later
P4) ebook-audio-diff - align audio-word-list and ebook-word-list
- this is the ONLY STEP that compares the ebook to the audiobook
- this step handles bad spelling, bad pronunciation, proper nouns, missing passages, extra words, skipped sentences, footnotes read in line instead of at the end of the chapter, EVERYTHING that is different between the audiobook and the ebook
- it uses the myers difference algorithm, finding the Longest-Common-Subsequence
- i.e.: its literally just diff -y
P5) ebook-word-timing - combine ebook-word-list, ebook-audio-diff, and audio-word-timing to get ebook timing
- take ebook-word-list and add two column, START_TIME_SECONDS and AUDIO_FILE
- using ebook-audio-diff, find the highest index of the word in audio-word-list that is not part of the longest-common-subsequence of a later word
- take that index and get the timing from audio-word-timing, and fill in START_TIME_SECONDS/AUDIO_FILE columns with it
- after this point, the words in the audiobook are not used and never appear again
  - you will never see mispronounced words, spoken errors, etc. every word from here on out appears, in order, in the FB2/EPUB/TXT
- this is the start time of each word in the EBOOK
- this is the contents of *.wordtiming, and is the final output of the perl script

coolreader

CR1) sentence-info - when audiobook-tts starts, navigate to each sentence and get info
- jump to page 0
- select the first sentence on the page
- select the next sentence, repeatedly, until there are no sentences left in the book
- after selecting each sentence, record the sentence-text and the dom-start-pos
- this is done in CPP, invoked from a JNI file, and sent to java as a List<SentenceInfo`
- this is turned into an (improper) CSV with two columns, START_POS and TEXT
  - start pos is a DOM id, from ldomXPointerEx->toString() (it never has any commas)
  - TEXT is allowed to contain commas, because its the last column (hence, this is an improper CSV)
  - e.g.: /text/p[45].135, having little or no money in my pocket
- java caches this sentence info in a file, *.sentenceinfo, if coolreader has write perms where the ebook is
- this step is the only slow part in coolreader
- it is purely the coolreader sentence structure parsing
- it has NOTHING to do with audiobooks, or wordtimings, or anything
CR2) sentence-words - parse each sentence into a list of words, as close to step P4) as possible
- its never exactly the same, because pandoc is not coolreader, but the difference is SMALL
CR3) sentence-start-times - compare sentence-words to ebook-word-timing file
- load *.wordtiming, parse into a list of word/start-time pairs
- for each sentence, take each word in that sentence and try to apply to the next word in wordtiming
- a sentence must match EVERY SINGLE WORD in sentence-words to a word in wordtiming
- however, allow skipping up to 20 WORDTIMING words (this is arbitrary, but if its very long, passages could be skipped)
- if a sentence does not match every word (happens all the time), consume the words that did match, use the lowest start time matched, and continue to the next sentence
- the output here is a start TIME and start PHYSICAL-POSITION for every sentence (this is the final pre-processing goal)
CR4) start-playback - play audiobook instead of TTS
- when tts-start is invoked, get the DOM location of the start of the initially selected sentence
- select the closest sentence in sentence-start-times
- open the audio file, and seek to the start time
- NOTE: if this is the first sentence in the audio file, ALWAYS seek to 0
  - this way, you hear the full audiobook
  - start music, copyright, "Start of CD Number Three", etc
CR5) continue-playback - select the next sentence as audiobook position continues
- do not stop playback, ever, without user interaction (this way, you get to hear the end of each audiobook file)
- when media playback time is after the next sentence in sentence-start-times, select the next sentence as if user clicked Next >>
- when media playback STOPS, and the next sentence is the FIRST sentence of a new audio file, start the next audio file

teleshoes · 2023-04-22T18:30:53Z

moved ebook-audiobook-wordtiming to its own repo:
https://github.com/teleshoes/ebook-audiobook-wordtiming

-in addition to period, exclamation point, and question mark -very large sentence-selections are frequent problems in many books -semi-colon usually splits TTS chunks reasonably well

teleshoes added 15 commits April 7, 2023 19:22

lvdocview: add nextSentence(), for iterating over all sentences

6863053

jni[docview]: getAllSentences() to get coords and text of all sentences

1e096a0

tts: fetch all sentences when initializing TTS

51b3b25

audiobook: implement playing audiobooks in TTS, using vosk word timings

a499382

audiobook: add audiobook navigation in TTS for vosk *.wordtiming files

eb82727

audiobook: run parseWordTimingsFile() in a worker thread to prevent ANR

634b92a

audiobook: add option 'app.tts.use.audiobook' to settings in TTS ui

3dd172f

audiobook: replace '.split()' for performance (11s => 4s)

0fd8de8

audiobook: replace WORD_TIMING_REGEX for performance (4s => 1s)

0377970

audiobook: re-use the same File object across sentences

2636053

audiobook: start each audio file at 0.0s

a039fcb

audiobook: do not select next sentence multiple times for new audiofiles

8742cd1

audiobook: null-check MediaPlayer

bd844f0

tts: pull all TTS control buttons to class vars

bc749bf

audiobook: hide the top row of buttons while calculating word timings

ee12e38

teleshoes force-pushed the audiobook_in_tts branch from cd4d405 to 85bacd0 Compare April 13, 2023 03:56

teleshoes added 4 commits April 15, 2023 11:24

audiobook: use startPos doc position strings instead of (x,y) coords

86699c3

audiobook: cache sentence info next to the ebook

5254c6c

audiobook: close the word timings file handle

efae0b1

audiobook: add fuzzier word-matching for ebook words vs ebook sentences

7b58b20

this is unrelated to audiobook word matching, strictly about comparing ebook word splitting in coolreader to external

plotn reviewed Apr 19, 2023

View reviewed changes

audiobook: read wordtiming+sentencecache using try-with-resources

b251204

teleshoes added 2 commits April 21, 2023 12:05

sentenceinfo: implement exportSentenceInfo(infile, outfile) in lvdocview

e53e9ab

cr3qt: add -s CLI wrapper around lvdocview.exportSentenceInfo(inF,outF)

7bacec4

teleshoes force-pushed the audiobook_in_tts branch from 38ac2a6 to f6eb6a5 Compare April 21, 2023 17:02

audiobook: allow different file extensions for audio files vs wordtiming

c701b7a

teleshoes force-pushed the audiobook_in_tts branch from abcb4ad to c701b7a Compare April 25, 2023 14:40

sentenceinfo: fix bug where last sentence could be omitted

8813539

teleshoes added 11 commits April 25, 2023 15:58

sentenceinfo: remove unnecessary call to checkRender()

4e8a4be

sentenceinfo: do not call thisSentenceStart() twice when not necessary

01e97ff

audiobook: allow scripts other than latin for splitting sentences

e86325f

audiobook: allow scripts other than latin for comparing words

1c8b46a

audiobook: remove unused import android.util.Log

dc25705

audiobook: add a main() method for debugging wordtiming+sentenceinfo

a66052c

audiobook: add non-android implementation of L.java for debugging

6f73d5e

audiobook: add script to run WordTimingAudiobookMatcher java class

b5662a6

sentenceinfo: treat semi-colon as sentence break

590cdb9

-in addition to period, exclamation point, and question mark -very large sentence-selections are frequent problems in many books -semi-colon usually splits TTS chunks reasonably well

text: treat ':' and ';' like '.'+'!'+'?' when measuring text

d6fbfb8

wordtimings: do not store wordtiming CSV lines in RAM while processing

46f0d5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audiobook: support reading from audio files in TTS #353

audiobook: support reading from audio files in TTS #353

teleshoes commented Apr 12, 2023 •

edited

Loading

plotn Apr 19, 2023

teleshoes Apr 19, 2023

teleshoes commented Apr 19, 2023

teleshoes commented Apr 22, 2023

audiobook: support reading from audio files in TTS #353

Are you sure you want to change the base?

audiobook: support reading from audio files in TTS #353

Conversation

teleshoes commented Apr 12, 2023 • edited Loading

plotn Apr 19, 2023

Choose a reason for hiding this comment

teleshoes Apr 19, 2023

Choose a reason for hiding this comment

teleshoes commented Apr 19, 2023

perl

coolreader

teleshoes commented Apr 22, 2023

teleshoes commented Apr 12, 2023 •

edited

Loading