DeepSeek-V4-Flash startup fails on Hopper: `_fp8_act_quant_dequant` IndexError on `scale.transpose(0, 1)`

## Summary

`tokenspeed serve deepseek-ai/DeepSeek-V4-Flash` fails to start on Hopper
(SM 9.0) at \`main\` HEAD (2859f54). All 4 TP workers raise the same
`IndexError` inside the first forward pass during model init, before any
request is served.

## Stack trace (verbatim, all 4 ranks identical)

```
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 3271, in _project_q_kv
    qr_kv = _fp8_linear(
            ^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 238, in _fp8_linear
    _fp8_act_quant_dequant(x, DEEPSEEK_V4_FP8_BLOCK_SIZE)
File "/usr/local/lib/python3.12/dist-packages/tokenspeed/runtime/models/deepseek_v4.py", line 212, in _fp8_act_quant_dequant
    scale = scale.float().transpose(0, 1).contiguous()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```

## Diagnosis

The CUDA fast path in `_fp8_act_quant_dequant`
(`python/tokenspeed/runtime/models/deepseek_v4.py:205-216`) calls

```python
quantized, scale = trtllm_fp8_quantize_1x128(
    x_2d,                # shape (M, K), contiguous
    block_size,          # = DEEPSEEK_V4_FP8_BLOCK_SIZE = 128
    use_ue8m0=True,
)
scale = scale.float().transpose(0, 1).contiguous()
```

where `trtllm_fp8_quantize_1x128` is the alias
`tokenspeed_kernel.thirdparty.trtllm.per_token_group_quant_8bit`. The
caller assumes `scale` is at least 2-D, but on this build it comes back
1-D (`scale.dim() == 1`), so `.transpose(0, 1)` errors out.

Either:
- the kernel started returning a squeezed 1-D scale and the call site
  hasn't been adjusted, or
- the kernel always returns 1-D when M == 1 and the call site needs to
  branch on shape.

I haven't bisected which of the two it is — happy to do a targeted
shape probe if useful.

## Environment

- tokenspeed: `main` @ 2859f54 (also reproduces at c5cc8c4, the #242
  merge commit) — note this fails even with a freshly built image, not
  a stale layer
- tokenspeed-kernel: `0.1.0.dev20260525+git00000000` (built today from
  the `lightseekorg/tokenspeed-runner:cu130-torch-2.11.0` base, so the
  kernel matches the same date as the `main` Python checkout)
- torch: `2.11.0+cu130`
- HW: NVIDIA H20-3e × 4 (SM 9.0, 141 GB HBM3e)
- Image base: `lightseekorg/tokenspeed-runner:cu130-torch-2.11.0`
- Build: `docker build -f docker/Dockerfile -t v4flash-tokenspeed:smoke .`

## Repro

```bash
tokenspeed serve deepseek-ai/DeepSeek-V4-Flash \
    --host 127.0.0.1 --port 8000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 16384 \
    --max-num-seqs 16 \
    --max-total-tokens 262144 \
    --chunked-prefill-size 16384 \
    --enable-mixed-batch \
    --gpu-memory-utilization 0.90 \
    --disable-kvstore
```

Crash happens during weight load / first forward, well before any
client request is served. Same failure with or without
`--enforce-eager`, and with `--moe-backend mxfp4` (also reproduces with
a third-party `humming_w4a8` MoE backend, but that path is unrelated —
the crash is in shared `_fp8_linear`).

## Workaround tried

None that's clean. The path is reached during model init via
`_project_q_kv` so dropping into eager mode doesn't bypass it. The
simulated fallback in the `else` branch of `_fp8_act_quant_dequant`
(non-CUDA / non-`DEEPSEEK_V4_FP8_BLOCK_SIZE`) would work but isn't
reachable from this call site.

## Notes

Surfaced this while doing perf validation for PR #238 (vectorize
`read_deepseek_v4_indexer_fp8_cache`) on current `main`. Happy to test
fixes on H20-3e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-V4-Flash startup fails on Hopper: `_fp8_act_quant_dequant` IndexError on `scale.transpose(0, 1)` #246

Summary

Stack trace (verbatim, all 4 ranks identical)

Diagnosis

Environment

Repro

Workaround tried

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DeepSeek-V4-Flash startup fails on Hopper: _fp8_act_quant_dequant IndexError on scale.transpose(0, 1) #246

Description

Summary

Stack trace (verbatim, all 4 ranks identical)

Diagnosis

Environment

Repro

Workaround tried

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

DeepSeek-V4-Flash startup fails on Hopper: `_fp8_act_quant_dequant` IndexError on `scale.transpose(0, 1)` #246