TP2 vLLM startup requires two kvcached fixes before the server becomes ready

## Summary

When kvcached is enabled on the latest main, a TP2 vLLM startup can hang before the API server becomes ready.

The process stays alive and model loading proceeds, but the server port never starts listening until two separate TP startup defects are addressed.

This does not reproduce when running the same TP2 startup configuration without kvcached.

## Environment

- kvcached: latest main
  - base commit: 0d4d581dc5b17ea8f07a83a9f9cf5345ba24478b
- vLLM: 0.18.1
- GPU: NVIDIA GeForce RTX 4090
- OS: Linux
- Model: Qwen3-8B-FP8
- tensor_parallel_size: 2
- kv cache mode: kvcached autopatch enabled

## Reproduction Setup

This was reproduced on a direct TP2 startup path without relying on controller readiness logic.

Environment variables:

```bash
ENABLE_KVCACHED=true
KVCACHED_AUTOPATCH=1
KVCACHED_IPC_NAME=kvcached_hotfix_probe3
NCCL_CUMEM_HOST_ENABLE=0
VLLM_SERVER_DEV_MODE=1
CUDA_VISIBLE_DEVICES=0,1
```

vLLM startup configuration:

```text
model: /root/offload-lab/local-models/Qwen3-8B-FP8
port: 19113
gpu-memory-utilization: 0.35
kv-cache-memory-bytes: 536870912
max-model-len: 4096
kv-cache-dtype: fp8_e4m3
enable-sleep-mode: true
enable-prefix-caching: false
disable-log-stats: true
tensor-parallel-size: 2
max-num-seqs: 64
```

## Actual Behavior

With kvcached enabled, the TP2 process stays alive but `/health` remains unreachable until both of the following defects are fixed:

1. The coordinator path can initialize kvcached with `world_size=1` even though EngineCore already knows `tensor_parallel_size=2`.
2. After correcting that world size, TP startup can still hang when the background prealloc thread starts too early and races the first null-block allocation on the multi-process map path.

Observed progression during debugging:

- bare TP2 without kvcached starts and `/health` returns 200
- TP2 + kvcached on latest main hangs before port listen
- fixing the coordinator world size moves startup past the earlier `world_size=1` allocator setup point but does not yet make the server ready
- fixing the prealloc timing on top of that makes `/health` return 200 and the server starts normally

## Root Cause Breakdown

### Root cause 1: coordinator reads TP world size too early

The EngineCore patch records the correct TP size, but the `KVCacheCoordinator` path can still query vLLM parallel state at a point where it observes `1`.

That causes kvcached to initialize its coordinator-side KVCacheManager with the wrong world size even though the run is actually TP2.

### Root cause 2: prealloc startup races the first TP null-block allocation

Once the coordinator world size is corrected, startup can still stall in the first real `KVCacheManager.alloc(1)` used by vLLM's null block.

The verified workaround is to defer starting the background prealloc thread until after that first alloc completes in the TP multi-process path. With that change in place, the same TP2 startup reaches:

- `Starting vLLM server on http://0.0.0.0:19113`
- `Application startup complete`
- `GET /health` -> 200

## Expected Behavior

TP2 startup with kvcached enabled should complete normally and begin serving `/health` and `/v1/models` without requiring any local workaround.

## Important Comparison

Running the same TP2 startup configuration without kvcached succeeds.

So this does not look like:

- a generic TP2 vLLM startup failure
- insufficient GPU memory
- controller-specific readiness logic
- unsupported TP usage being forced from outside

It is a kvcached TP startup-path issue.

## Validation Status

The problem has been reduced to two focused fixes:

- PR 1: coordinator world size fix
- PR 2: deferred prealloc startup for the TP multi-process null-block path

Applying both fixes together was validated on the direct TP2 reproduction above and resulted in `/health = 200`.

## Related

- #334 documents the earlier single-GPU startup hang class
- #336 fixes the single-GPU prealloc-thread callback deadlock


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP2 vLLM startup requires two kvcached fixes before the server becomes ready #339

Summary

Environment

Reproduction Setup

Actual Behavior

Root Cause Breakdown

Root cause 1: coordinator reads TP world size too early

Root cause 2: prealloc startup races the first TP null-block allocation

Expected Behavior

Important Comparison

Validation Status

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

TP2 vLLM startup requires two kvcached fixes before the server becomes ready #339

Description

Summary

Environment

Reproduction Setup

Actual Behavior

Root Cause Breakdown

Root cause 1: coordinator reads TP world size too early

Root cause 2: prealloc startup races the first TP null-block allocation

Expected Behavior

Important Comparison

Validation Status

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions