EAGLE3 drafter: indexSelectSmallIndex OOB during CUDA graph capture when accept_lengths[0]==0 (DP > 1)

## Summary

EAGLE3 drafter throws an asynchronous CUDA assertion (`indexSelectSmallIndex: srcIndex < srcSelectDimSize`) during CUDA graph capture on a decode-only batch when `accept_lengths[0] == 0`. Root cause is an asymmetry in how the `padded_gather_ids_offsets_buf` "-1" baseline is compensated: the EXTEND branch adds `+ num_prefill_tokens` (always ≥ 1), the DECODE-ONLY branch adds nothing. With `accept_lengths[0] = 0` the resulting `indices[0] = -1` → out-of-bounds.

## Reproducer

Model: `nvidia/Kimi-K2.5-NVFP4`. Config `attn_dp8_moe_ep8` from `test/agentic_benchmark/tokenspeed/configs/attn_dp8_moe_ep8.sh` (also reproduces with `attn_dp8_moe_tp8` — anything `DP > 1` + EAGLE3 + CUDA graphs). After all CuteDSL "Loading all available dialects" lines print, the engine fires:

```
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1515: indexSelectSmallIndex:
  Assertion `srcIndex < srcSelectDimSize` failed.
```

en masse from many threads; engine subprocess dies; peer ranks then see Gloo `Connection closed by peer` and the whole engine tears down. (CUDA async means the Python stack at the time of host-side detect is misleading — `CUDA_LAUNCH_BLOCKING=1` would localize.)

**Validated**: rerunning the exact same config with no EAGLE3 (omitting all `--speculative-*` flags) reaches SERVING and serves traffic. The bug is in the drafter, not in any model-side or attention-backend kernel.

## Root cause

`python/tokenspeed/runtime/execution/drafter/eagle.py:131` initializes the offsets buffer with a "-1" baseline:

```python
# spec_num_tokens is typically 4 in EAGLE3 configs.
self.padded_gather_ids_offsets_buf = (
    torch.arange(self.input_buffers.max_bs, dtype=torch.int64, device=self.device)
    * spec_num_tokens
    - 1
)
# Resulting tensor: [-1, 3, 7, 11, ...]
```

There are two consumers. The EXTEND branch cancels the "-1" via `+ num_prefill_tokens` (≥ 1 by construction in EXTEND):

```python
# eagle.py:194-198  — EXTEND branch
gather_ids = torch.cat([gather_ids,
    self.padded_gather_ids_offsets_buf[:num_decodes]
    + draft_input.accept_lengths[num_extends:]
    + num_prefill_tokens,           # cancels the -1
])
```

The DECODE-ONLY branch (the one CUDA graph capture exercises) does **not**:

```python
# eagle.py:377-385  — DECODE-only branch
indices = (
    self.padded_gather_ids_offsets_buf[:num_decodes]
    + draft_input.accept_lengths[num_extends:]
    # no + num_prefill_tokens here
)
torch.index_select(draft_input.base_model_output, 0, indices, ...)
```

When CUDA graph capture is recording a decode-only dummy batch and `accept_lengths[0]` happens to be 0 (the verify backend can legitimately return 0 accepted draft tokens; nothing forces the dummy to be ≥ 1), `indices[0] = -1`. CUDA `indexSelectSmallIndex` fires the bounds assertion on every output thread.

## Workaround (verified)

Disable EAGLE3 by omitting **all** `--speculative-*` flags. (`--speculative-algorithm none` is not accepted — the choice set is `{EAGLE3, MTP}`.) This bypasses the drafter entirely, at the cost of losing the spec-decode throughput win.

## Suggested fix

Pick one of:

1. **Remove the "-1" baseline.** Replace `* spec_num_tokens - 1` with just `* spec_num_tokens`, then in the EXTEND branch drop the `+ num_prefill_tokens` (or keep it if it serves another purpose) and in the DECODE-ONLY branch nothing else changes. Most invasive but eliminates the asymmetry.
2. **Apply the same `+ num_prefill_tokens` in the DECODE-ONLY branch.** Under decode-only `num_prefill_tokens` is 0, so this is a no-op — UNLESS we also seed it with a constant 1 to cancel the "-1". I.e. `+ max(num_prefill_tokens, 1)`. Smallest diff but a bit cryptic without a comment.
3. **Clamp `accept_lengths` to ≥ 1 in the verify dummy used during CUDA graph capture.** Doesn't fix the latent buggy invariant but unblocks the immediate symptom.

(1) is preferred — the "-1" trick has no payoff once you have to compensate it in every consumer anyway.

## Why it's been latent

For `DP=1` configs (e.g. `attn_tp4_moe_tp4`, the 4-GPU baseline) the CUDA graph capture sequence appears to never hit `accept_lengths[0] == 0` in the decode-only branch — possibly because the dummy verify under DP=1 produces ≥ 1 accepted, or the capture sweep doesn't visit the offending shape. Under `DP ≥ 2` the dummy-input path through the drafter differs (see `if self.dp_size > 1:` at `eagle.py:294` setting `global_num_tokens = [bs] * world_size`) and the zero-accept-length capture iteration becomes reachable.

## Environment

- TokenSpeed `main` at the time of test (latest commit ~`a9bc218`, code path unchanged).
- Kimi K2.5-NVFP4, NVFP4 weights + fp8 KV cache + EAGLE3 spec-dec.
- 8 × NVIDIA B200, single host.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EAGLE3 drafter: indexSelectSmallIndex OOB during CUDA graph capture when accept_lengths[0]==0 (DP > 1) #298

Summary

Reproducer

Root cause

Workaround (verified)

Suggested fix

Why it's been latent

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

EAGLE3 drafter: indexSelectSmallIndex OOB during CUDA graph capture when accept_lengths[0]==0 (DP > 1) #298

Description

Summary

Reproducer

Root cause

Workaround (verified)

Suggested fix

Why it's been latent

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions