fix: reuse recorded tp world size in coordinator by shipiyouniao · Pull Request #340 · ovg-project/kvcached

shipiyouniao · 2026-05-21T10:00:41Z

Summary

Fix the TP2 coordinator startup path by reusing the TP world size already recorded during EngineCore initialization instead of querying vLLM parallel state too early.

Related issue: #339

Root cause

In TP startup, EngineCore already knows the correct tensor_parallel_size, and kvcached records that value in interfaces._world_size.

But KVCacheCoordinator was still calling get_tensor_model_parallel_world_size() at a point where vLLM could still report 1.

That caused the coordinator-side KVCacheManager to be initialized with world_size=1 during a TP2 run, which is incorrect for the worker IPC / allocator setup that follows.

Changes

Stop querying TP world size from vLLM parallel_state in KVCacheCoordinatorPatch
Reuse kvcached.integration.vllm.interfaces._world_size, which was already recorded during EngineCore init
Add a regression test covering the timing case where parallel_state still reports 1 but the recorded EngineCore TP size is 2

Why this approach

The coordinator should not rediscover TP size from a startup-time API that is not yet stable for this path.

The EngineCore patch already computes and records the correct value earlier in the same startup sequence, so reusing that value is both narrower and more reliable.

Validation

Added unit test: python -m pytest tests/test_tp_world_size_patch.py -q
Observed in remote TP2 reproduction that the allocator setup moved from world_size=1 to world_size=2 after this fix
This fix alone moved startup past the earlier incorrect-world-size point and exposed the next TP startup issue

Notes

This is intentionally split from the prealloc timing fix so the two TP startup defects can be reviewed independently.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the vLLM integration patch so KVCacheCoordinator uses the tensor-parallel (TP) world size captured during EngineCore initialization, and adds a regression test to verify the behavior.

Changes:

Switch TP world size detection from vllm.distributed.parallel_state to kvcached.integration.vllm.interfaces._world_size.
Add a new unit test that verifies init_kvcached(world_size=...) uses the EngineCore-recorded world size.
Add supporting module stubs/mocks for torch, vllm, and kvcached integration imports in the new test.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
tests/test_tp_world_size_patch.py	Adds a regression test ensuring coordinator init uses the EngineCore-recorded TP world size.
kvcached/integration/vllm/patches.py	Uses `interfaces._world_size` for TP size during coordinator setup to avoid early-startup `world_size=1` observations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fix: reuse recorded tp world size in coordinator

9690829

Copilot AI review requested due to automatic review settings May 21, 2026 10:00

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread tests/test_tp_world_size_patch.py Outdated

Comment thread tests/test_tp_world_size_patch.py Outdated

Comment thread kvcached/integration/vllm/patches.py Outdated

Comment thread kvcached/integration/vllm/patches.py Outdated

fix: harden coordinator world size follow-up

5fb92e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reuse recorded tp world size in coordinator#340

fix: reuse recorded tp world size in coordinator#340
shipiyouniao wants to merge 2 commits into
ovg-project:mainfrom
shipiyouniao:fix/tp2-kvcached-world-size

shipiyouniao commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shipiyouniao commented May 21, 2026

Summary

Root cause

Changes

Why this approach

Validation

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants