Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow downloading extra formats in the demo #617

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

roedoejet
Copy link
Member

PR Goal?

This PR allows us to download any of the available output formats through the demo (TextGrid, ReadAlong, Spectrogram, etc). We can also optional disable certain output formats

Fixes?

This adds to fixing #607

Feedback sought?

Test it out, confirm it works as expected.

Priority?

medium

Tests added?

We currently don't test the demo - but we're moving to do that in @joanise 's PR

How to test?

spin up a demo and try synthesizing other formats

Confidence?

medium-high (I've tested it and it works well on my Mac)

Version change?

Related PRs?

Copy link

semanticdiff-com bot commented Dec 20, 2024

Review changes with  SemanticDiff

Changed Files
File Status
  everyvoice/demo/app.py  3% smaller
  everyvoice/cli.py  0% smaller
  everyvoice/model/feature_prediction/FastSpeech2_lightning  0% smaller

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 10.52632% with 34 lines in your changes missing coverage. Please review.

Project coverage is 76.07%. Comparing base (1c2aae6) to head (aa365d0).

Files with missing lines Patch % Lines
everyvoice/demo/app.py 10.52% 34 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #617      +/-   ##
==========================================
- Coverage   76.75%   76.07%   -0.69%     
==========================================
  Files          46       46              
  Lines        3446     3481      +35     
  Branches      470      479       +9     
==========================================
+ Hits         2645     2648       +3     
- Misses        700      732      +32     
  Partials      101      101              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

CLI load time: 0:00.35
Pull Request HEAD: aa365d07cdc949f8eab4aeca0473feeec5d81188
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package
import time:      1054 |     103621 |     typer.main
import time:       291 |     122599 |   typer
import time:       244 |     101070 |       everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.cli
import time:       179 |     101686 |     everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli
import time:        18 |     101704 |   everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.preprocess
import time:      7947 |     258926 | everyvoice.cli

Copy link
Member

@joanise joanise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, though I have a number of comments and suggestions.

In general, I find it not very intuitive to have to re-synthesize if I change the output format. I guess from how things work, it's probably unavoidable, but it's not ideal UX. Maybe the File Output box could have a hint indicating that Output Format is where you change what's available for download here?

When you keep just the wav output format, it's confusing that the File Output box is present but you can't interact with it. I wanted to download my audio file, and I tried to find it in the File Output box. It took me a little while to locate the download button in the Audio box above.

@@ -616,6 +616,12 @@ def demo(
"-s",
help="Specify speakers to be included in the demo. Example: everyvoice demo <path_to_text_to_spec_model> <path_to_spec_to_wav_model> --speaker speaker_1 --speaker Sue",
),
outputs: List[str] = typer.Option(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to enumerate the valid options in the help message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for --accelerator, while I'm thinking about it... And for --language and --speaker we should state that they have to be language(s) and speaker(s) known to the model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, please address listing valid values for --output-format, fixing the other help messages is gravy and could go into a separate PR or issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for --accelerator, while I'm thinking about it... And for --language and --speaker we should state that they have to be language(s) and speaker(s) known to the model.

I think they do get listed, don't they? Like if you type a speaker that doesn't exist, I thought the error message listed out all the possible speakers. The output formats are dependent on the version of everyvoice installed, so we could include that in the help message, but the language and speaker are model-dependent, so we wouldn't be able to include the lists of those in the help message, just in the error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that the everyvoice demo -h message should say something like "valid values are the language(s) and speaker(s) the model was trained on", or something to that effect, maybe more concisely. As the documentation stands, if you're not familiar with things yet, it's a bit mysterious how you're supposed to know what values you can use there.
And I know if you're just trained things, it's going to be obvious, but the point the of the help message is to support you when the information is not already obvious to you.

Comment on lines +305 to +308
else:
print(
f"Attention: This model is not able to produce '{output}' as an output. The '{output}' option will not be available for selection. Please choose from the following possible outputs: {', '.join(possible_outputs)}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a fatal error with an immediate exit, and the message is misleading: it's not that the model can't produce the requested output, it's that the software has no implementation for it. Right now, everyvoice demo -O foo fs2.ckpt voc.ckpt prints this message about foo and then continues anyway and crashes with an exception a few lines later.

This is really CLI error checking, it should happen much earlier in this function, in particular before we load any checkpoint, so the error is dumped right away without having to wait 20 seconds or more for models to load first. You might get all this for nearly free if you define the list of valid values for outputs in cli.py's demo() function as I already suggested elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the RAS output specifiers are readalong_xml and readalong_html with an underscore instead of a hyphen like in everyvoice synthesize from-text. They should be unified, using hyphens here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!

wav_output = wav_writer.get_filename(basename, speaker, language)
file_writer = None
file_output = None
if output_format == SynthesizeOutputFormats.readalong_html.name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and all the other callback constructor calls really ought to be able to be replaced by a single call to get_synthesis_output_callbacks, no?

I realize you need the wav writer to create the RAS_html writer, but that's already done too.

Refactoring suggestion: have get_synthesis_output_callbacks return a dict where the key are the output types, and the values are the writers. Then here you can use writers["wav"] and writers[output_format] to access the two writers you need, having passed ["wav", output_format] (or the proper Enum form if need be) as the output_type argument to get_synthesis_output_callbacks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, this is a good idea. I would definitely be in favour of this refactor, which I think would clean up some of the acrobatics we're currently doing in order to get the wav writer separately.

@roedoejet
Copy link
Member Author

In general, I find it not very intuitive to have to re-synthesize if I change the output format. I guess from how things work, it's probably unavoidable, but it's not ideal UX. Maybe the File Output box could have a hint indicating that Output Format is where you change what's available for download here?

I agree, but I'm not sure gradio provides a callback for when the output format gets changed. It's possible, but given how expensive it is to synthesize (depending on the underlying hardware), I'm not super opposed to this UX, despite agreeing that it's clunky, as you say.

When you keep just the wav output format, it's confusing that the File Output box is present but you can't interact with it. I wanted to download my audio file, and I tried to find it in the File Output box. It took me a little while to locate the download button in the Audio box above.

Yes, the problem is that I don't believe gradio allows you to dynamically change the interface once it's been rendered. So we might be stuck with that. Note that it doesn't get rendered if the only possible output format is "wav" (i.e. if the demo command disables all the other output formats).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants