Modern gigaspeech recipe by yfyeung · Pull Request #2090 · k2-fsa/icefall

yfyeung · 2026-06-23T08:34:12Z

Summary by CodeRabbit

New Features
- Added an --on-the-fly fbank workflow to generate trimmed cut manifests without precomputing features.
- Added --use-bf16 for mixed-precision training, plus a torchrun-ready cluster launcher.
- Expanded decoding with ctc-greedy-search and ctc-prefix-beam-search.
Improvements
- Made ASR preparation steps configurable, including conditional MUSAN processing and on-the-fly feature flags.
- Updated dataloader defaults (workers/prefetch/persistence/pin-memory) and improved on-the-fly dataset filtering.
- Refined distributed setup and adjusted evaluation decoding to use dev/test sets.

- Add `--on-the-fly` (default true) to compute_fbank_gigaspeech.py and compute_fbank_gigaspeech_splits.py: skip storing fbank features and only produce the trimmed cut manifests, since zipformer extracts features on-the-fly during training. - Parallelize the on-the-fly split trimming with a process pool (the GPU feature-compute path stays serial). - Add `use_musan` (default false) to prepare.sh, skipping the musan download/manifest/fbank stages (0, 2, 7) unless `--use-musan true`. - Default `--enable-musan` to false in zipformer/asr_datamodule.py to match. - Tune train dataloader for on-the-fly: num_workers default 8, add `--prefetch-factor` (default 4, was hardcoded 16), guard num_workers=0. - Fix stage log typo and the stage 4 subset description. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

update training update update update update update Update egs/gigaspeech/ASR/zipformer/train.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Update ctc_decode.py Update ctc_decode.py

coderabbitai · 2026-06-24T10:05:03Z

📝 Walkthrough

Walkthrough

The PR adds on-the-fly fbank preparation for GigaSpeech, updates Zipformer training for bf16 and torchrun launches, changes dataloader defaults, adds a cluster training launcher, and extends CTC decoding with new search modes and result handling.

Changes

GigaSpeech On-the-fly Fbank Preparation

Layer / File(s)	Summary
On-the-fly fbank script `egs/gigaspeech/ASR/local/compute_fbank_gigaspeech.py`	Adds `--on-the-fly` parsing, updates `compute_fbank_gigaspeech(on_the_fly: bool = True)`, and skips feature extraction when the flag is enabled.
Split trimming and prepare.sh wiring `egs/gigaspeech/ASR/local/compute_fbank_gigaspeech_splits.py`, `egs/gigaspeech/ASR/prepare.sh`	Adds parallel trimmed-manifest generation for on-the-fly splits and wires `prepare.sh` to pass `--on-the-fly` and conditionally run musan stages.

Zipformer Training, Launch, and Data Loading

Layer / File(s)	Summary
bf16 AMP and torchrun training runtime `egs/gigaspeech/ASR/zipformer/train.py`, `egs/gigaspeech/ASR/zipformer/run_train_cluster.sh`	Adds bf16/autocast support, torchrun-aware distributed setup, local-rank device placement, `remove_short_utt` duration fallback, and a new cluster launcher.
DataLoader defaults and prefetch behavior `egs/gigaspeech/ASR/zipformer/asr_datamodule.py`	Flips on-the-fly and musan defaults, increases worker count, adds `--prefetch-factor`, and updates train/validation loader settings.

CTC Decoding Updates

Layer / File(s)	Summary
CTC decode modes and result processing `egs/gigaspeech/ASR/zipformer/ctc_decode.py`	Adds greedy and prefix-beam CTC decoding, post-processes transcripts, updates result filenames and evaluation cuts, and adjusts graph setup for the new modes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

csukuangfj

Poem

🐇 I hop through manifests, light on the trail,
on-the-fly features now skip the stored bale.
bf16 sparkles, torchrun sets the pace,
and prefix-beam rabbits dash into place.
With prefetch-factor humming, the training stays spry —
this bunny gives a cheerful ear-flick: “whee!”

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title matches the main goal: modernizing the GigaSpeech recipe.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

egs/gigaspeech/ASR/zipformer/run_train_cluster.sh (1)

20-31: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Preserve caller-provided cluster environment settings.

These hard-coded exports override scheduler/site-tuned NCCL and OMP values, which can break launches on clusters without eth/mlx5 or with different transport settings. Prefer defaults that users can override.

Proposed fix

-export NCCL_IB_TC=136
-export NCCL_IB_SL=5
-export NCCL_IB_GID_INDEX=3
-export NCCL_SOCKET_IFNAME=eth
-export NCCL_DEBUG=WARN
-export NCCL_IB_HCA=mlx5
-export NCCL_IB_TIMEOUT=22
-export NCCL_IB_QPS_PER_CONNECTION=8
-export NCCL_MIN_NCHANNELS=4
-export NCCL_NET_PLUGIN=none
-export OMP_NUM_THREADS=4
-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export NCCL_IB_TC=${NCCL_IB_TC:-136}
+export NCCL_IB_SL=${NCCL_IB_SL:-5}
+export NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX:-3}
+export NCCL_SOCKET_IFNAME=${NCCL_SOCKET_IFNAME:-eth}
+export NCCL_DEBUG=${NCCL_DEBUG:-WARN}
+export NCCL_IB_HCA=${NCCL_IB_HCA:-mlx5}
+export NCCL_IB_TIMEOUT=${NCCL_IB_TIMEOUT:-22}
+export NCCL_IB_QPS_PER_CONNECTION=${NCCL_IB_QPS_PER_CONNECTION:-8}
+export NCCL_MIN_NCHANNELS=${NCCL_MIN_NCHANNELS:-4}
+export NCCL_NET_PLUGIN=${NCCL_NET_PLUGIN:-none}
+export OMP_NUM_THREADS=${OMP_NUM_THREADS:-4}
+export PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@egs/gigaspeech/ASR/zipformer/run_train_cluster.sh` around lines 20 - 31,
These hard-coded environment exports in run_train_cluster.sh override caller- or
scheduler-provided NCCL and OMP settings, so update the script to only set safe
defaults when variables are unset and avoid forcing cluster-specific values like
NCCL_SOCKET_IFNAME and NCCL_IB_HCA. Use the existing export block near the top
of the script to switch to conditional assignments or parameterized defaults so
users can override transport, timeout, and threading settings without the script
clobbering them.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py`:
- Around line 386-391: Reuse the CTC prefix beam search process pool across
batches instead of creating it inside each call to ctc_prefix_beam_search().
Thread a shared dataset-scoped pool through decode_dataset() into
decode_one_batch(), and pass it as process_pool when calling
ctc_prefix_beam_search() so the same workers are reused for all batches. Make
sure the pool lifecycle is managed at the dataset level and closed once after
decoding completes.
- Around line 740-741: Honor the --bpe-model option in the SentencePiece loading
path: the current branch in ctc_decode.py hardcodes params.lang_dir /
"bpe.model" inside the bpe_model load logic, which ignores a user-supplied
model. Update the code around the bpe_model / spm.SentencePieceProcessor()
initialization to load from the parsed bpe-model argument when provided, and
only fall back to the default language-directory model when no override is set.

---

Nitpick comments:
In `@egs/gigaspeech/ASR/zipformer/run_train_cluster.sh`:
- Around line 20-31: These hard-coded environment exports in
run_train_cluster.sh override caller- or scheduler-provided NCCL and OMP
settings, so update the script to only set safe defaults when variables are
unset and avoid forcing cluster-specific values like NCCL_SOCKET_IFNAME and
NCCL_IB_HCA. Use the existing export block near the top of the script to switch
to conditional assignments or parameterized defaults so users can override
transport, timeout, and threading settings without the script clobbering them.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8a3986d-832c-4ad7-acde-fc5fa03d079b

📥 Commits

Reviewing files that changed from the base of the PR and between c7895d5 and ab2f217.

📒 Files selected for processing (4)

egs/gigaspeech/ASR/zipformer/asr_datamodule.py
egs/gigaspeech/ASR/zipformer/ctc_decode.py
egs/gigaspeech/ASR/zipformer/run_train_cluster.sh
egs/gigaspeech/ASR/zipformer/train.py

🚧 Files skipped from review as they are similar to previous changes (1)

egs/gigaspeech/ASR/zipformer/asr_datamodule.py

coderabbitai · 2026-06-24T10:07:02Z

+    if params.decoding_method == "ctc-prefix-beam-search":
+        token_ids = ctc_prefix_beam_search(
+            ctc_output=ctc_output,
+            encoder_out_lens=encoder_out_lens,
+            beam=params.beam,
+        )


🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy lift

Reuse the CTC prefix beam process pool across batches.

ctc_prefix_beam_search() creates and tears down a Pool() when process_pool is omitted; calling it once per batch can make prefix decoding spend most of its time forking workers. Thread a dataset-scoped pool through decode_dataset()/decode_one_batch() and close it once.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py` around lines 386 - 391, Reuse the CTC prefix beam search process pool across batches instead of creating it inside each call to ctc_prefix_beam_search(). Thread a shared dataset-scoped pool through decode_dataset() into decode_one_batch(), and pass it as process_pool when calling ctc_prefix_beam_search() so the same workers are reused for all batches. Make sure the pool lifecycle is managed at the dataset level and closed once after decoding completes.

coderabbitai · 2026-06-24T10:07:02Z

        bpe_model = spm.SentencePieceProcessor()
        bpe_model.load(str(params.lang_dir / "bpe.model"))


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Honor the --bpe-model argument.

The parser exposes --bpe-model, but this branch always loads params.lang_dir / "bpe.model", so a custom BPE model is silently ignored.

Proposed fix

bpe_model = spm.SentencePieceProcessor() - bpe_model.load(str(params.lang_dir / "bpe.model")) + bpe_model.load(str(params.bpe_model))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

bpe_model = spm.SentencePieceProcessor()

bpe_model.load(str(params.lang_dir / "bpe.model"))

bpe_model = spm.SentencePieceProcessor()

bpe_model.load(str(params.bpe_model))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@egs/gigaspeech/ASR/zipformer/ctc_decode.py` around lines 740 - 741, Honor the --bpe-model option in the SentencePiece loading path: the current branch in ctc_decode.py hardcodes params.lang_dir / "bpe.model" inside the bpe_model load logic, which ignores a user-supplied model. Update the code around the bpe_model / spm.SentencePieceProcessor() initialization to load from the parsed bpe-model argument when provided, and only fall back to the default language-directory model when no override is set.

yfyeung and others added 2 commits June 21, 2026 19:59

fix comment for gigaspeech recipe

4856ae0

This comment was marked as resolved.

Sign in to view

This comment was marked as low quality.

Sign in to view

This comment was marked as duplicate.

Sign in to view

support bf16 for gigaspeech recipe

ab2f217

update training update update update update update Update egs/gigaspeech/ASR/zipformer/train.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Update ctc_decode.py Update ctc_decode.py

yfyeung force-pushed the modern_gigaspeech branch from c7895d5 to ab2f217 Compare June 24, 2026 10:01

k2-fsa deleted a comment from coderabbitai Bot Jun 24, 2026

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modern gigaspeech recipe#2090

Modern gigaspeech recipe#2090
yfyeung wants to merge 3 commits into
k2-fsa:masterfrom
yfyeung:modern_gigaspeech

yfyeung commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as duplicate.

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Uh oh!

coderabbitai Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		bpe_model = spm.SentencePieceProcessor()
		bpe_model.load(str(params.lang_dir / "bpe.model"))

Uh oh!

Conversation

yfyeung commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as duplicate.

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yfyeung commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading