Summary
EAGLE3 drafter throws an asynchronous CUDA assertion (indexSelectSmallIndex: srcIndex < srcSelectDimSize) during CUDA graph capture on a decode-only batch when accept_lengths[0] == 0. Root cause is an asymmetry in how the padded_gather_ids_offsets_buf "-1" baseline is compensated: the EXTEND branch adds + num_prefill_tokens (always ≥ 1), the DECODE-ONLY branch adds nothing. With accept_lengths[0] = 0 the resulting indices[0] = -1 → out-of-bounds.
Reproducer
Model: nvidia/Kimi-K2.5-NVFP4. Config attn_dp8_moe_ep8 from test/agentic_benchmark/tokenspeed/configs/attn_dp8_moe_ep8.sh (also reproduces with attn_dp8_moe_tp8 — anything DP > 1 + EAGLE3 + CUDA graphs). After all CuteDSL "Loading all available dialects" lines print, the engine fires:
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1515: indexSelectSmallIndex:
Assertion `srcIndex < srcSelectDimSize` failed.
en masse from many threads; engine subprocess dies; peer ranks then see Gloo Connection closed by peer and the whole engine tears down. (CUDA async means the Python stack at the time of host-side detect is misleading — CUDA_LAUNCH_BLOCKING=1 would localize.)
Validated: rerunning the exact same config with no EAGLE3 (omitting all --speculative-* flags) reaches SERVING and serves traffic. The bug is in the drafter, not in any model-side or attention-backend kernel.
Root cause
python/tokenspeed/runtime/execution/drafter/eagle.py:131 initializes the offsets buffer with a "-1" baseline:
# spec_num_tokens is typically 4 in EAGLE3 configs.
self.padded_gather_ids_offsets_buf = (
torch.arange(self.input_buffers.max_bs, dtype=torch.int64, device=self.device)
* spec_num_tokens
- 1
)
# Resulting tensor: [-1, 3, 7, 11, ...]
There are two consumers. The EXTEND branch cancels the "-1" via + num_prefill_tokens (≥ 1 by construction in EXTEND):
# eagle.py:194-198 — EXTEND branch
gather_ids = torch.cat([gather_ids,
self.padded_gather_ids_offsets_buf[:num_decodes]
+ draft_input.accept_lengths[num_extends:]
+ num_prefill_tokens, # cancels the -1
])
The DECODE-ONLY branch (the one CUDA graph capture exercises) does not:
# eagle.py:377-385 — DECODE-only branch
indices = (
self.padded_gather_ids_offsets_buf[:num_decodes]
+ draft_input.accept_lengths[num_extends:]
# no + num_prefill_tokens here
)
torch.index_select(draft_input.base_model_output, 0, indices, ...)
When CUDA graph capture is recording a decode-only dummy batch and accept_lengths[0] happens to be 0 (the verify backend can legitimately return 0 accepted draft tokens; nothing forces the dummy to be ≥ 1), indices[0] = -1. CUDA indexSelectSmallIndex fires the bounds assertion on every output thread.
Workaround (verified)
Disable EAGLE3 by omitting all --speculative-* flags. (--speculative-algorithm none is not accepted — the choice set is {EAGLE3, MTP}.) This bypasses the drafter entirely, at the cost of losing the spec-decode throughput win.
Suggested fix
Pick one of:
- Remove the "-1" baseline. Replace
* spec_num_tokens - 1 with just * spec_num_tokens, then in the EXTEND branch drop the + num_prefill_tokens (or keep it if it serves another purpose) and in the DECODE-ONLY branch nothing else changes. Most invasive but eliminates the asymmetry.
- Apply the same
+ num_prefill_tokens in the DECODE-ONLY branch. Under decode-only num_prefill_tokens is 0, so this is a no-op — UNLESS we also seed it with a constant 1 to cancel the "-1". I.e. + max(num_prefill_tokens, 1). Smallest diff but a bit cryptic without a comment.
- Clamp
accept_lengths to ≥ 1 in the verify dummy used during CUDA graph capture. Doesn't fix the latent buggy invariant but unblocks the immediate symptom.
(1) is preferred — the "-1" trick has no payoff once you have to compensate it in every consumer anyway.
Why it's been latent
For DP=1 configs (e.g. attn_tp4_moe_tp4, the 4-GPU baseline) the CUDA graph capture sequence appears to never hit accept_lengths[0] == 0 in the decode-only branch — possibly because the dummy verify under DP=1 produces ≥ 1 accepted, or the capture sweep doesn't visit the offending shape. Under DP ≥ 2 the dummy-input path through the drafter differs (see if self.dp_size > 1: at eagle.py:294 setting global_num_tokens = [bs] * world_size) and the zero-accept-length capture iteration becomes reachable.
Environment
- TokenSpeed
main at the time of test (latest commit ~a9bc218, code path unchanged).
- Kimi K2.5-NVFP4, NVFP4 weights + fp8 KV cache + EAGLE3 spec-dec.
- 8 × NVIDIA B200, single host.
Summary
EAGLE3 drafter throws an asynchronous CUDA assertion (
indexSelectSmallIndex: srcIndex < srcSelectDimSize) during CUDA graph capture on a decode-only batch whenaccept_lengths[0] == 0. Root cause is an asymmetry in how thepadded_gather_ids_offsets_buf"-1" baseline is compensated: the EXTEND branch adds+ num_prefill_tokens(always ≥ 1), the DECODE-ONLY branch adds nothing. Withaccept_lengths[0] = 0the resultingindices[0] = -1→ out-of-bounds.Reproducer
Model:
nvidia/Kimi-K2.5-NVFP4. Configattn_dp8_moe_ep8fromtest/agentic_benchmark/tokenspeed/configs/attn_dp8_moe_ep8.sh(also reproduces withattn_dp8_moe_tp8— anythingDP > 1+ EAGLE3 + CUDA graphs). After all CuteDSL "Loading all available dialects" lines print, the engine fires:en masse from many threads; engine subprocess dies; peer ranks then see Gloo
Connection closed by peerand the whole engine tears down. (CUDA async means the Python stack at the time of host-side detect is misleading —CUDA_LAUNCH_BLOCKING=1would localize.)Validated: rerunning the exact same config with no EAGLE3 (omitting all
--speculative-*flags) reaches SERVING and serves traffic. The bug is in the drafter, not in any model-side or attention-backend kernel.Root cause
python/tokenspeed/runtime/execution/drafter/eagle.py:131initializes the offsets buffer with a "-1" baseline:There are two consumers. The EXTEND branch cancels the "-1" via
+ num_prefill_tokens(≥ 1 by construction in EXTEND):The DECODE-ONLY branch (the one CUDA graph capture exercises) does not:
When CUDA graph capture is recording a decode-only dummy batch and
accept_lengths[0]happens to be 0 (the verify backend can legitimately return 0 accepted draft tokens; nothing forces the dummy to be ≥ 1),indices[0] = -1. CUDAindexSelectSmallIndexfires the bounds assertion on every output thread.Workaround (verified)
Disable EAGLE3 by omitting all
--speculative-*flags. (--speculative-algorithm noneis not accepted — the choice set is{EAGLE3, MTP}.) This bypasses the drafter entirely, at the cost of losing the spec-decode throughput win.Suggested fix
Pick one of:
* spec_num_tokens - 1with just* spec_num_tokens, then in the EXTEND branch drop the+ num_prefill_tokens(or keep it if it serves another purpose) and in the DECODE-ONLY branch nothing else changes. Most invasive but eliminates the asymmetry.+ num_prefill_tokensin the DECODE-ONLY branch. Under decode-onlynum_prefill_tokensis 0, so this is a no-op — UNLESS we also seed it with a constant 1 to cancel the "-1". I.e.+ max(num_prefill_tokens, 1). Smallest diff but a bit cryptic without a comment.accept_lengthsto ≥ 1 in the verify dummy used during CUDA graph capture. Doesn't fix the latent buggy invariant but unblocks the immediate symptom.(1) is preferred — the "-1" trick has no payoff once you have to compensate it in every consumer anyway.
Why it's been latent
For
DP=1configs (e.g.attn_tp4_moe_tp4, the 4-GPU baseline) the CUDA graph capture sequence appears to never hitaccept_lengths[0] == 0in the decode-only branch — possibly because the dummy verify under DP=1 produces ≥ 1 accepted, or the capture sweep doesn't visit the offending shape. UnderDP ≥ 2the dummy-input path through the drafter differs (seeif self.dp_size > 1:ateagle.py:294settingglobal_num_tokens = [bs] * world_size) and the zero-accept-length capture iteration becomes reachable.Environment
mainat the time of test (latest commit ~a9bc218, code path unchanged).