Skip to content

DeepSeek-V4-Flash startup fails on Hopper: _fp8_act_quant_dequant IndexError on scale.transpose(0, 1) #246

@yuanqingz

Description

@yuanqingz

Summary

tokenspeed serve deepseek-ai/DeepSeek-V4-Flash fails to start on Hopper
(SM 9.0) at `main` HEAD (2859f54). All 4 TP workers raise the same
IndexError inside the first forward pass during model init, before any
request is served.

Stack trace (verbatim, all 4 ranks identical)

File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 3271, in _project_q_kv
    qr_kv = _fp8_linear(
            ^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 238, in _fp8_linear
    _fp8_act_quant_dequant(x, DEEPSEEK_V4_FP8_BLOCK_SIZE)
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 212, in _fp8_act_quant_dequant
    scale = scale.float().transpose(0, 1).contiguous()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Diagnosis

The CUDA fast path in _fp8_act_quant_dequant
(python/tokenspeed/runtime/models/deepseek_v4.py:205-216) calls

quantized, scale = trtllm_fp8_quantize_1x128(
    x_2d,                # shape (M, K), contiguous
    block_size,          # = DEEPSEEK_V4_FP8_BLOCK_SIZE = 128
    use_ue8m0=True,
)
scale = scale.float().transpose(0, 1).contiguous()

where trtllm_fp8_quantize_1x128 is the alias
tokenspeed_kernel.thirdparty.trtllm.per_token_group_quant_8bit. The
caller assumes scale is at least 2-D, but on this build it comes back
1-D (scale.dim() == 1), so .transpose(0, 1) errors out.

Either:

  • the kernel started returning a squeezed 1-D scale and the call site
    hasn't been adjusted, or
  • the kernel always returns 1-D when M == 1 and the call site needs to
    branch on shape.

I haven't bisected which of the two it is — happy to do a targeted
shape probe if useful.

Environment

  • tokenspeed: main @ 2859f54 (also reproduces at c5cc8c4, the refactor(deepseek-v4): clean up attention metadata and cache helpers #242
    merge commit) — note this fails even with a freshly built image, not
    a stale layer
  • tokenspeed-kernel: 0.1.0.dev20260525+git00000000 (built today from
    the lightseekorg/tokenspeed-runner:cu130-torch-2.11.0 base, so the
    kernel matches the same date as the main Python checkout)
  • torch: 2.11.0+cu130
  • HW: NVIDIA H20-3e × 4 (SM 9.0, 141 GB HBM3e)
  • Image base: lightseekorg/tokenspeed-runner:cu130-torch-2.11.0
  • Build: docker build -f docker/Dockerfile -t v4flash-tokenspeed:smoke .

Repro

tokenspeed serve deepseek-ai/DeepSeek-V4-Flash \
    --host 127.0.0.1 --port 8000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 16384 \
    --max-num-seqs 16 \
    --max-total-tokens 262144 \
    --chunked-prefill-size 16384 \
    --enable-mixed-batch \
    --gpu-memory-utilization 0.90 \
    --disable-kvstore

Crash happens during weight load / first forward, well before any
client request is served. Same failure with or without
--enforce-eager, and with --moe-backend mxfp4 (also reproduces with
a third-party humming_w4a8 MoE backend, but that path is unrelated —
the crash is in shared _fp8_linear).

Workaround tried

None that's clean. The path is reached during model init via
_project_q_kv so dropping into eager mode doesn't bypass it. The
simulated fallback in the else branch of _fp8_act_quant_dequant
(non-CUDA / non-DEEPSEEK_V4_FP8_BLOCK_SIZE) would work but isn't
reachable from this call site.

Notes

Surfaced this while doing perf validation for PR #238 (vectorize
read_deepseek_v4_indexer_fp8_cache) on current main. Happy to test
fixes on H20-3e.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions