Modern gigaspeech recipe#2090
Conversation
- Add `--on-the-fly` (default true) to compute_fbank_gigaspeech.py and compute_fbank_gigaspeech_splits.py: skip storing fbank features and only produce the trimmed cut manifests, since zipformer extracts features on-the-fly during training. - Parallelize the on-the-fly split trimming with a process pool (the GPU feature-compute path stays serial). - Add `use_musan` (default false) to prepare.sh, skipping the musan download/manifest/fbank stages (0, 2, 7) unless `--use-musan true`. - Default `--enable-musan` to false in zipformer/asr_datamodule.py to match. - Tune train dataloader for on-the-fly: num_workers default 8, add `--prefetch-factor` (default 4, was hardcoded 16), guard num_workers=0. - Fix stage log typo and the stage 4 subset description. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
update training update update update update update Update egs/gigaspeech/ASR/zipformer/train.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Update ctc_decode.py Update ctc_decode.py
c7895d5 to
ab2f217
Compare
📝 WalkthroughWalkthroughThe PR adds on-the-fly fbank preparation for GigaSpeech, updates Zipformer training for bf16 and torchrun launches, changes dataloader defaults, adds a cluster training launcher, and extends CTC decoding with new search modes and result handling. ChangesGigaSpeech On-the-fly Fbank Preparation
Zipformer Training, Launch, and Data Loading
CTC Decoding Updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
egs/gigaspeech/ASR/zipformer/run_train_cluster.sh (1)
20-31: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winPreserve caller-provided cluster environment settings.
These hard-coded exports override scheduler/site-tuned NCCL and OMP values, which can break launches on clusters without
eth/mlx5or with different transport settings. Prefer defaults that users can override.Proposed fix
-export NCCL_IB_TC=136 -export NCCL_IB_SL=5 -export NCCL_IB_GID_INDEX=3 -export NCCL_SOCKET_IFNAME=eth -export NCCL_DEBUG=WARN -export NCCL_IB_HCA=mlx5 -export NCCL_IB_TIMEOUT=22 -export NCCL_IB_QPS_PER_CONNECTION=8 -export NCCL_MIN_NCHANNELS=4 -export NCCL_NET_PLUGIN=none -export OMP_NUM_THREADS=4 -export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +export NCCL_IB_TC=${NCCL_IB_TC:-136} +export NCCL_IB_SL=${NCCL_IB_SL:-5} +export NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX:-3} +export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-eth} +export NCCL_DEBUG=${NCCL_DEBUG:-WARN} +export NCCL_IB_HCA=${NCCL_IB_HCA:-mlx5} +export NCCL_IB_TIMEOUT=${NCCL_IB_TIMEOUT:-22} +export NCCL_IB_QPS_PER_CONNECTION=${NCCL_IB_QPS_PER_CONNECTION:-8} +export NCCL_MIN_NCHANNELS=${NCCL_MIN_NCHANNELS:-4} +export NCCL_NET_PLUGIN=${NCCL_NET_PLUGIN:-none} +export OMP_NUM_THREADS=${OMP_NUM_THREADS:-4} +export PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@egs/gigaspeech/ASR/zipformer/run_train_cluster.sh` around lines 20 - 31, These hard-coded environment exports in run_train_cluster.sh override caller- or scheduler-provided NCCL and OMP settings, so update the script to only set safe defaults when variables are unset and avoid forcing cluster-specific values like NCCL_SOCKET_IFNAME and NCCL_IB_HCA. Use the existing export block near the top of the script to switch to conditional assignments or parameterized defaults so users can override transport, timeout, and threading settings without the script clobbering them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py`:
- Around line 386-391: Reuse the CTC prefix beam search process pool across
batches instead of creating it inside each call to ctc_prefix_beam_search().
Thread a shared dataset-scoped pool through decode_dataset() into
decode_one_batch(), and pass it as process_pool when calling
ctc_prefix_beam_search() so the same workers are reused for all batches. Make
sure the pool lifecycle is managed at the dataset level and closed once after
decoding completes.
- Around line 740-741: Honor the --bpe-model option in the SentencePiece loading
path: the current branch in ctc_decode.py hardcodes params.lang_dir /
"bpe.model" inside the bpe_model load logic, which ignores a user-supplied
model. Update the code around the bpe_model / spm.SentencePieceProcessor()
initialization to load from the parsed bpe-model argument when provided, and
only fall back to the default language-directory model when no override is set.
---
Nitpick comments:
In `@egs/gigaspeech/ASR/zipformer/run_train_cluster.sh`:
- Around line 20-31: These hard-coded environment exports in
run_train_cluster.sh override caller- or scheduler-provided NCCL and OMP
settings, so update the script to only set safe defaults when variables are
unset and avoid forcing cluster-specific values like NCCL_SOCKET_IFNAME and
NCCL_IB_HCA. Use the existing export block near the top of the script to switch
to conditional assignments or parameterized defaults so users can override
transport, timeout, and threading settings without the script clobbering them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f8a3986d-832c-4ad7-acde-fc5fa03d079b
📒 Files selected for processing (4)
egs/gigaspeech/ASR/zipformer/asr_datamodule.pyegs/gigaspeech/ASR/zipformer/ctc_decode.pyegs/gigaspeech/ASR/zipformer/run_train_cluster.shegs/gigaspeech/ASR/zipformer/train.py
🚧 Files skipped from review as they are similar to previous changes (1)
- egs/gigaspeech/ASR/zipformer/asr_datamodule.py
| if params.decoding_method == "ctc-prefix-beam-search": | ||
| token_ids = ctc_prefix_beam_search( | ||
| ctc_output=ctc_output, | ||
| encoder_out_lens=encoder_out_lens, | ||
| beam=params.beam, | ||
| ) |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy lift
Reuse the CTC prefix beam process pool across batches.
ctc_prefix_beam_search() creates and tears down a Pool() when process_pool is omitted; calling it once per batch can make prefix decoding spend most of its time forking workers. Thread a dataset-scoped pool through decode_dataset()/decode_one_batch() and close it once.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py` around lines 386 - 391, Reuse the
CTC prefix beam search process pool across batches instead of creating it inside
each call to ctc_prefix_beam_search(). Thread a shared dataset-scoped pool
through decode_dataset() into decode_one_batch(), and pass it as process_pool
when calling ctc_prefix_beam_search() so the same workers are reused for all
batches. Make sure the pool lifecycle is managed at the dataset level and closed
once after decoding completes.
| bpe_model = spm.SentencePieceProcessor() | ||
| bpe_model.load(str(params.lang_dir / "bpe.model")) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Honor the --bpe-model argument.
The parser exposes --bpe-model, but this branch always loads params.lang_dir / "bpe.model", so a custom BPE model is silently ignored.
Proposed fix
bpe_model = spm.SentencePieceProcessor()
- bpe_model.load(str(params.lang_dir / "bpe.model"))
+ bpe_model.load(str(params.bpe_model))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bpe_model = spm.SentencePieceProcessor() | |
| bpe_model.load(str(params.lang_dir / "bpe.model")) | |
| bpe_model = spm.SentencePieceProcessor() | |
| bpe_model.load(str(params.bpe_model)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py` around lines 740 - 741, Honor the
--bpe-model option in the SentencePiece loading path: the current branch in
ctc_decode.py hardcodes params.lang_dir / "bpe.model" inside the bpe_model load
logic, which ignores a user-supplied model. Update the code around the bpe_model
/ spm.SentencePieceProcessor() initialization to load from the parsed bpe-model
argument when provided, and only fall back to the default language-directory
model when no override is set.
Summary by CodeRabbit
--on-the-flyfbank workflow to generate trimmed cut manifests without precomputing features.--use-bf16for mixed-precision training, plus atorchrun-ready cluster launcher.ctc-greedy-searchandctc-prefix-beam-search.dev/testsets.