Skip to content

EAGLE3 drafter: indexSelectSmallIndex OOB during CUDA graph capture when accept_lengths[0]==0 (DP > 1) #298

@yuanqingz

Description

@yuanqingz

Summary

EAGLE3 drafter throws an asynchronous CUDA assertion (indexSelectSmallIndex: srcIndex < srcSelectDimSize) during CUDA graph capture on a decode-only batch when accept_lengths[0] == 0. Root cause is an asymmetry in how the padded_gather_ids_offsets_buf "-1" baseline is compensated: the EXTEND branch adds + num_prefill_tokens (always ≥ 1), the DECODE-ONLY branch adds nothing. With accept_lengths[0] = 0 the resulting indices[0] = -1 → out-of-bounds.

Reproducer

Model: nvidia/Kimi-K2.5-NVFP4. Config attn_dp8_moe_ep8 from test/agentic_benchmark/tokenspeed/configs/attn_dp8_moe_ep8.sh (also reproduces with attn_dp8_moe_tp8 — anything DP > 1 + EAGLE3 + CUDA graphs). After all CuteDSL "Loading all available dialects" lines print, the engine fires:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1515: indexSelectSmallIndex:
  Assertion `srcIndex < srcSelectDimSize` failed.

en masse from many threads; engine subprocess dies; peer ranks then see Gloo Connection closed by peer and the whole engine tears down. (CUDA async means the Python stack at the time of host-side detect is misleading — CUDA_LAUNCH_BLOCKING=1 would localize.)

Validated: rerunning the exact same config with no EAGLE3 (omitting all --speculative-* flags) reaches SERVING and serves traffic. The bug is in the drafter, not in any model-side or attention-backend kernel.

Root cause

python/tokenspeed/runtime/execution/drafter/eagle.py:131 initializes the offsets buffer with a "-1" baseline:

# spec_num_tokens is typically 4 in EAGLE3 configs.
self.padded_gather_ids_offsets_buf = (
    torch.arange(self.input_buffers.max_bs, dtype=torch.int64, device=self.device)
    * spec_num_tokens
    - 1
)
# Resulting tensor: [-1, 3, 7, 11, ...]

There are two consumers. The EXTEND branch cancels the "-1" via + num_prefill_tokens (≥ 1 by construction in EXTEND):

# eagle.py:194-198  — EXTEND branch
gather_ids = torch.cat([gather_ids,
    self.padded_gather_ids_offsets_buf[:num_decodes]
    + draft_input.accept_lengths[num_extends:]
    + num_prefill_tokens,           # cancels the -1
])

The DECODE-ONLY branch (the one CUDA graph capture exercises) does not:

# eagle.py:377-385  — DECODE-only branch
indices = (
    self.padded_gather_ids_offsets_buf[:num_decodes]
    + draft_input.accept_lengths[num_extends:]
    # no + num_prefill_tokens here
)
torch.index_select(draft_input.base_model_output, 0, indices, ...)

When CUDA graph capture is recording a decode-only dummy batch and accept_lengths[0] happens to be 0 (the verify backend can legitimately return 0 accepted draft tokens; nothing forces the dummy to be ≥ 1), indices[0] = -1. CUDA indexSelectSmallIndex fires the bounds assertion on every output thread.

Workaround (verified)

Disable EAGLE3 by omitting all --speculative-* flags. (--speculative-algorithm none is not accepted — the choice set is {EAGLE3, MTP}.) This bypasses the drafter entirely, at the cost of losing the spec-decode throughput win.

Suggested fix

Pick one of:

  1. Remove the "-1" baseline. Replace * spec_num_tokens - 1 with just * spec_num_tokens, then in the EXTEND branch drop the + num_prefill_tokens (or keep it if it serves another purpose) and in the DECODE-ONLY branch nothing else changes. Most invasive but eliminates the asymmetry.
  2. Apply the same + num_prefill_tokens in the DECODE-ONLY branch. Under decode-only num_prefill_tokens is 0, so this is a no-op — UNLESS we also seed it with a constant 1 to cancel the "-1". I.e. + max(num_prefill_tokens, 1). Smallest diff but a bit cryptic without a comment.
  3. Clamp accept_lengths to ≥ 1 in the verify dummy used during CUDA graph capture. Doesn't fix the latent buggy invariant but unblocks the immediate symptom.

(1) is preferred — the "-1" trick has no payoff once you have to compensate it in every consumer anyway.

Why it's been latent

For DP=1 configs (e.g. attn_tp4_moe_tp4, the 4-GPU baseline) the CUDA graph capture sequence appears to never hit accept_lengths[0] == 0 in the decode-only branch — possibly because the dummy verify under DP=1 produces ≥ 1 accepted, or the capture sweep doesn't visit the offending shape. Under DP ≥ 2 the dummy-input path through the drafter differs (see if self.dp_size > 1: at eagle.py:294 setting global_num_tokens = [bs] * world_size) and the zero-accept-length capture iteration becomes reachable.

Environment

  • TokenSpeed main at the time of test (latest commit ~a9bc218, code path unchanged).
  • Kimi K2.5-NVFP4, NVFP4 weights + fp8 KV cache + EAGLE3 spec-dec.
  • 8 × NVIDIA B200, single host.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions