You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tokenspeed serve deepseek-ai/DeepSeek-V4-Flash fails to start on Hopper
(SM 9.0) at `main` HEAD (2859f54). All 4 TP workers raise the same IndexError inside the first forward pass during model init, before any
request is served.
Stack trace (verbatim, all 4 ranks identical)
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 3271, in _project_q_kv
qr_kv = _fp8_linear(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 238, in _fp8_linear
_fp8_act_quant_dequant(x, DEEPSEEK_V4_FP8_BLOCK_SIZE)
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 212, in _fp8_act_quant_dequant
scale = scale.float().transpose(0, 1).contiguous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
Diagnosis
The CUDA fast path in _fp8_act_quant_dequant
(python/tokenspeed/runtime/models/deepseek_v4.py:205-216) calls
where trtllm_fp8_quantize_1x128 is the alias tokenspeed_kernel.thirdparty.trtllm.per_token_group_quant_8bit. The
caller assumes scale is at least 2-D, but on this build it comes back
1-D (scale.dim() == 1), so .transpose(0, 1) errors out.
Either:
the kernel started returning a squeezed 1-D scale and the call site
hasn't been adjusted, or
the kernel always returns 1-D when M == 1 and the call site needs to
branch on shape.
I haven't bisected which of the two it is — happy to do a targeted
shape probe if useful.
tokenspeed-kernel: 0.1.0.dev20260525+git00000000 (built today from
the lightseekorg/tokenspeed-runner:cu130-torch-2.11.0 base, so the
kernel matches the same date as the main Python checkout)
Crash happens during weight load / first forward, well before any
client request is served. Same failure with or without --enforce-eager, and with --moe-backend mxfp4 (also reproduces with
a third-party humming_w4a8 MoE backend, but that path is unrelated —
the crash is in shared _fp8_linear).
Workaround tried
None that's clean. The path is reached during model init via _project_q_kv so dropping into eager mode doesn't bypass it. The
simulated fallback in the else branch of _fp8_act_quant_dequant
(non-CUDA / non-DEEPSEEK_V4_FP8_BLOCK_SIZE) would work but isn't
reachable from this call site.
Notes
Surfaced this while doing perf validation for PR #238 (vectorize read_deepseek_v4_indexer_fp8_cache) on current main. Happy to test
fixes on H20-3e.
Summary
tokenspeed serve deepseek-ai/DeepSeek-V4-Flashfails to start on Hopper(SM 9.0) at `main` HEAD (2859f54). All 4 TP workers raise the same
IndexErrorinside the first forward pass during model init, before anyrequest is served.
Stack trace (verbatim, all 4 ranks identical)
Diagnosis
The CUDA fast path in
_fp8_act_quant_dequant(
python/tokenspeed/runtime/models/deepseek_v4.py:205-216) callswhere
trtllm_fp8_quantize_1x128is the aliastokenspeed_kernel.thirdparty.trtllm.per_token_group_quant_8bit. Thecaller assumes
scaleis at least 2-D, but on this build it comes back1-D (
scale.dim() == 1), so.transpose(0, 1)errors out.Either:
hasn't been adjusted, or
branch on shape.
I haven't bisected which of the two it is — happy to do a targeted
shape probe if useful.
Environment
main@ 2859f54 (also reproduces at c5cc8c4, the refactor(deepseek-v4): clean up attention metadata and cache helpers #242merge commit) — note this fails even with a freshly built image, not
a stale layer
0.1.0.dev20260525+git00000000(built today fromthe
lightseekorg/tokenspeed-runner:cu130-torch-2.11.0base, so thekernel matches the same date as the
mainPython checkout)2.11.0+cu130lightseekorg/tokenspeed-runner:cu130-torch-2.11.0docker build -f docker/Dockerfile -t v4flash-tokenspeed:smoke .Repro
tokenspeed serve deepseek-ai/DeepSeek-V4-Flash \ --host 127.0.0.1 --port 8000 \ --trust-remote-code \ --tensor-parallel-size 4 \ --kv-cache-dtype fp8_e4m3 \ --max-model-len 16384 \ --max-num-seqs 16 \ --max-total-tokens 262144 \ --chunked-prefill-size 16384 \ --enable-mixed-batch \ --gpu-memory-utilization 0.90 \ --disable-kvstoreCrash happens during weight load / first forward, well before any
client request is served. Same failure with or without
--enforce-eager, and with--moe-backend mxfp4(also reproduces witha third-party
humming_w4a8MoE backend, but that path is unrelated —the crash is in shared
_fp8_linear).Workaround tried
None that's clean. The path is reached during model init via
_project_q_kvso dropping into eager mode doesn't bypass it. Thesimulated fallback in the
elsebranch of_fp8_act_quant_dequant(non-CUDA / non-
DEEPSEEK_V4_FP8_BLOCK_SIZE) would work but isn'treachable from this call site.
Notes
Surfaced this while doing perf validation for PR #238 (vectorize
read_deepseek_v4_indexer_fp8_cache) on currentmain. Happy to testfixes on H20-3e.