Skip to content

Commit

Permalink
modify intro texts
Browse files Browse the repository at this point in the history
  • Loading branch information
ARDiT-TTS committed May 31, 2024
1 parent c95fd7f commit d873a29
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 6 deletions.
8 changes: 5 additions & 3 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1259,10 +1259,11 @@ <h3>Prompted Generation</h3>
<p class="lead">* please scroll horizontally to explore additional columns in the table.</p>
</div>
<div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
<h3>Speech Inpainting</h3>
<h3>Speech Editing</h3>
<p class="lead">
In this task, we evaluate on test set C. We mask fragments of the waveforms, and ask the models to generate the full waveforms. The masked sections are highlighted within the text.
All speakers are unseen for all systems during training.
We evaluated the performance of text-based speech editing on the speech inpainting task.
The models generate complete waveforms given complete texts and partially masked waveforms. The masked sections are highlighted within the text.
All speakers were unseen by all systems during training. The following 20 test cases are from test set C (long).
</p>
<div class="table-responsive" style="overflow-x: scroll">
<table class="table table-sm">
Expand Down Expand Up @@ -2046,6 +2047,7 @@ <h3>Prompted Generation (Comparing with Proprietary Systems)</h3>
<p class="lead">
In this section, we compare our system with proprietary systems including NaturalSpeech 2/3, MegaTTS 2, UniAudio, CLaM-TTS, VoiceBox, and VALL-E. The source codes and model weights for these models are not available.
The following samples are obtained from their online demo pages. All waveforms are downsampled to 16kHz.
Please note that ARDiT's performance is influenced by the fact that the prompt waveforms are in 16kHz, not 24kHz, and the prompt texts are not semantically coherent with the target texts.
</p>
<p class="lead">1~4 are obtained from
<a href="https://speechresearch.github.io/naturalspeech3/">NaturalSpeech 3</a> and 5~20 are obtained from
Expand Down
8 changes: 5 additions & 3 deletions index.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,12 @@

with div(cls="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded"):
from inpaint import get_table
h3("Speech Inpainting")
h3("Speech Editing")
p(
"""
In this task, we evaluate on test set C. We mask fragments of the waveforms, and ask the models to generate the full waveforms. The masked sections are highlighted within the text.
All speakers are unseen for all systems during training.
We evaluated the performance of text-based speech editing on the speech inpainting task.
The models generate complete waveforms given complete texts and partially masked waveforms. The masked sections are highlighted within the text.
All speakers were unseen by all systems during training. The following 20 test cases are from test set C (long).
""",
cls="lead"
)
Expand All @@ -67,6 +68,7 @@
"""
In this section, we compare our system with proprietary systems including NaturalSpeech 2/3, MegaTTS 2, UniAudio, CLaM-TTS, VoiceBox, and VALL-E. The source codes and model weights for these models are not available.
The following samples are obtained from their online demo pages. All waveforms are downsampled to 16kHz.
Please note that ARDiT's performance is influenced by the fact that the prompt waveforms are in 16kHz, not 24kHz, and the prompt texts are not semantically coherent with the target texts.
""",
cls="lead"
)
Expand Down

0 comments on commit d873a29

Please sign in to comment.