Skip to content

Modern gigaspeech recipe#2090

Open
yfyeung wants to merge 3 commits into
k2-fsa:masterfrom
yfyeung:modern_gigaspeech
Open

Modern gigaspeech recipe#2090
yfyeung wants to merge 3 commits into
k2-fsa:masterfrom
yfyeung:modern_gigaspeech

Conversation

@yfyeung

@yfyeung yfyeung commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features
    • Added an --on-the-fly fbank workflow to generate trimmed cut manifests without precomputing features.
    • Added --use-bf16 for mixed-precision training, plus a torchrun-ready cluster launcher.
    • Expanded decoding with ctc-greedy-search and ctc-prefix-beam-search.
  • Improvements
    • Made ASR preparation steps configurable, including conditional MUSAN processing and on-the-fly feature flags.
    • Updated dataloader defaults (workers/prefetch/persistence/pin-memory) and improved on-the-fly dataset filtering.
    • Refined distributed setup and adjusted evaluation decoding to use dev/test sets.

yfyeung and others added 2 commits June 21, 2026 19:59
- Add `--on-the-fly` (default true) to compute_fbank_gigaspeech.py and
  compute_fbank_gigaspeech_splits.py: skip storing fbank features and only
  produce the trimmed cut manifests, since zipformer extracts features
  on-the-fly during training.
- Parallelize the on-the-fly split trimming with a process pool (the GPU
  feature-compute path stays serial).
- Add `use_musan` (default false) to prepare.sh, skipping the musan
  download/manifest/fbank stages (0, 2, 7) unless `--use-musan true`.
- Default `--enable-musan` to false in zipformer/asr_datamodule.py to match.
- Tune train dataloader for on-the-fly: num_workers default 8, add
  `--prefetch-factor` (default 4, was hardcoded 16), guard num_workers=0.
- Fix stage log typo and the stage 4 subset description.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gemini-code-assist[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as low quality.

coderabbitai[bot]

This comment was marked as duplicate.

update training

update

update

update

update

update

Update egs/gigaspeech/ASR/zipformer/train.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update ctc_decode.py

Update ctc_decode.py
@yfyeung yfyeung force-pushed the modern_gigaspeech branch from c7895d5 to ab2f217 Compare June 24, 2026 10:01
@k2-fsa k2-fsa deleted a comment from coderabbitai Bot Jun 24, 2026
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds on-the-fly fbank preparation for GigaSpeech, updates Zipformer training for bf16 and torchrun launches, changes dataloader defaults, adds a cluster training launcher, and extends CTC decoding with new search modes and result handling.

Changes

GigaSpeech On-the-fly Fbank Preparation

Layer / File(s) Summary
On-the-fly fbank script
egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py
Adds --on-the-fly parsing, updates compute_fbank_gigaspeech(on_the_fly: bool = True), and skips feature extraction when the flag is enabled.
Split trimming and prepare.sh wiring
egs/gigaspeech/ASR/local/compute_fbank_gigaspeech_splits.py, egs/gigaspeech/ASR/prepare.sh
Adds parallel trimmed-manifest generation for on-the-fly splits and wires prepare.sh to pass --on-the-fly and conditionally run musan stages.

Zipformer Training, Launch, and Data Loading

Layer / File(s) Summary
bf16 AMP and torchrun training runtime
egs/gigaspeech/ASR/zipformer/train.py, egs/gigaspeech/ASR/zipformer/run_train_cluster.sh
Adds bf16/autocast support, torchrun-aware distributed setup, local-rank device placement, remove_short_utt duration fallback, and a new cluster launcher.
DataLoader defaults and prefetch behavior
egs/gigaspeech/ASR/zipformer/asr_datamodule.py
Flips on-the-fly and musan defaults, increases worker count, adds --prefetch-factor, and updates train/validation loader settings.

CTC Decoding Updates

Layer / File(s) Summary
CTC decode modes and result processing
egs/gigaspeech/ASR/zipformer/ctc_decode.py
Adds greedy and prefix-beam CTC decoding, post-processes transcripts, updates result filenames and evaluation cuts, and adjusts graph setup for the new modes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • csukuangfj

Poem

🐇 I hop through manifests, light on the trail,
on-the-fly features now skip the stored bale.
bf16 sparkles, torchrun sets the pace,
and prefix-beam rabbits dash into place.
With prefetch-factor humming, the training stays spry —
this bunny gives a cheerful ear-flick: “whee!”

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main goal: modernizing the GigaSpeech recipe.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
egs/gigaspeech/ASR/zipformer/run_train_cluster.sh (1)

20-31: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Preserve caller-provided cluster environment settings.

These hard-coded exports override scheduler/site-tuned NCCL and OMP values, which can break launches on clusters without eth/mlx5 or with different transport settings. Prefer defaults that users can override.

Proposed fix
-export NCCL_IB_TC=136
-export NCCL_IB_SL=5
-export NCCL_IB_GID_INDEX=3
-export NCCL_SOCKET_IFNAME=eth
-export NCCL_DEBUG=WARN
-export NCCL_IB_HCA=mlx5
-export NCCL_IB_TIMEOUT=22
-export NCCL_IB_QPS_PER_CONNECTION=8
-export NCCL_MIN_NCHANNELS=4
-export NCCL_NET_PLUGIN=none
-export OMP_NUM_THREADS=4
-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export NCCL_IB_TC=${NCCL_IB_TC:-136}
+export NCCL_IB_SL=${NCCL_IB_SL:-5}
+export NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX:-3}
+export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-eth}
+export NCCL_DEBUG=${NCCL_DEBUG:-WARN}
+export NCCL_IB_HCA=${NCCL_IB_HCA:-mlx5}
+export NCCL_IB_TIMEOUT=${NCCL_IB_TIMEOUT:-22}
+export NCCL_IB_QPS_PER_CONNECTION=${NCCL_IB_QPS_PER_CONNECTION:-8}
+export NCCL_MIN_NCHANNELS=${NCCL_MIN_NCHANNELS:-4}
+export NCCL_NET_PLUGIN=${NCCL_NET_PLUGIN:-none}
+export OMP_NUM_THREADS=${OMP_NUM_THREADS:-4}
+export PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@egs/gigaspeech/ASR/zipformer/run_train_cluster.sh` around lines 20 - 31,
These hard-coded environment exports in run_train_cluster.sh override caller- or
scheduler-provided NCCL and OMP settings, so update the script to only set safe
defaults when variables are unset and avoid forcing cluster-specific values like
NCCL_SOCKET_IFNAME and NCCL_IB_HCA. Use the existing export block near the top
of the script to switch to conditional assignments or parameterized defaults so
users can override transport, timeout, and threading settings without the script
clobbering them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py`:
- Around line 386-391: Reuse the CTC prefix beam search process pool across
batches instead of creating it inside each call to ctc_prefix_beam_search().
Thread a shared dataset-scoped pool through decode_dataset() into
decode_one_batch(), and pass it as process_pool when calling
ctc_prefix_beam_search() so the same workers are reused for all batches. Make
sure the pool lifecycle is managed at the dataset level and closed once after
decoding completes.
- Around line 740-741: Honor the --bpe-model option in the SentencePiece loading
path: the current branch in ctc_decode.py hardcodes params.lang_dir /
"bpe.model" inside the bpe_model load logic, which ignores a user-supplied
model. Update the code around the bpe_model / spm.SentencePieceProcessor()
initialization to load from the parsed bpe-model argument when provided, and
only fall back to the default language-directory model when no override is set.

---

Nitpick comments:
In `@egs/gigaspeech/ASR/zipformer/run_train_cluster.sh`:
- Around line 20-31: These hard-coded environment exports in
run_train_cluster.sh override caller- or scheduler-provided NCCL and OMP
settings, so update the script to only set safe defaults when variables are
unset and avoid forcing cluster-specific values like NCCL_SOCKET_IFNAME and
NCCL_IB_HCA. Use the existing export block near the top of the script to switch
to conditional assignments or parameterized defaults so users can override
transport, timeout, and threading settings without the script clobbering them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8a3986d-832c-4ad7-acde-fc5fa03d079b

📥 Commits

Reviewing files that changed from the base of the PR and between c7895d5 and ab2f217.

📒 Files selected for processing (4)
  • egs/gigaspeech/ASR/zipformer/asr_datamodule.py
  • egs/gigaspeech/ASR/zipformer/ctc_decode.py
  • egs/gigaspeech/ASR/zipformer/run_train_cluster.sh
  • egs/gigaspeech/ASR/zipformer/train.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • egs/gigaspeech/ASR/zipformer/asr_datamodule.py

Comment on lines +386 to +391
if params.decoding_method == "ctc-prefix-beam-search":
token_ids = ctc_prefix_beam_search(
ctc_output=ctc_output,
encoder_out_lens=encoder_out_lens,
beam=params.beam,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy lift

Reuse the CTC prefix beam process pool across batches.

ctc_prefix_beam_search() creates and tears down a Pool() when process_pool is omitted; calling it once per batch can make prefix decoding spend most of its time forking workers. Thread a dataset-scoped pool through decode_dataset()/decode_one_batch() and close it once.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py` around lines 386 - 391, Reuse the
CTC prefix beam search process pool across batches instead of creating it inside
each call to ctc_prefix_beam_search(). Thread a shared dataset-scoped pool
through decode_dataset() into decode_one_batch(), and pass it as process_pool
when calling ctc_prefix_beam_search() so the same workers are reused for all
batches. Make sure the pool lifecycle is managed at the dataset level and closed
once after decoding completes.

Comment on lines 740 to 741
bpe_model = spm.SentencePieceProcessor()
bpe_model.load(str(params.lang_dir / "bpe.model"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Honor the --bpe-model argument.

The parser exposes --bpe-model, but this branch always loads params.lang_dir / "bpe.model", so a custom BPE model is silently ignored.

Proposed fix
         bpe_model = spm.SentencePieceProcessor()
-        bpe_model.load(str(params.lang_dir / "bpe.model"))
+        bpe_model.load(str(params.bpe_model))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bpe_model = spm.SentencePieceProcessor()
bpe_model.load(str(params.lang_dir / "bpe.model"))
bpe_model = spm.SentencePieceProcessor()
bpe_model.load(str(params.bpe_model))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py` around lines 740 - 741, Honor the
--bpe-model option in the SentencePiece loading path: the current branch in
ctc_decode.py hardcodes params.lang_dir / "bpe.model" inside the bpe_model load
logic, which ignores a user-supplied model. Update the code around the bpe_model
/ spm.SentencePieceProcessor() initialization to load from the parsed bpe-model
argument when provided, and only fall back to the default language-directory
model when no override is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant