Skip to content

fix: reuse recorded tp world size in coordinator#340

Open
shipiyouniao wants to merge 2 commits into
ovg-project:mainfrom
shipiyouniao:fix/tp2-kvcached-world-size
Open

fix: reuse recorded tp world size in coordinator#340
shipiyouniao wants to merge 2 commits into
ovg-project:mainfrom
shipiyouniao:fix/tp2-kvcached-world-size

Conversation

@shipiyouniao
Copy link
Copy Markdown

Summary

Fix the TP2 coordinator startup path by reusing the TP world size already recorded during EngineCore initialization instead of querying vLLM parallel state too early.

Related issue: #339

Root cause

In TP startup, EngineCore already knows the correct tensor_parallel_size, and kvcached records that value in interfaces._world_size.

But KVCacheCoordinator was still calling get_tensor_model_parallel_world_size() at a point where vLLM could still report 1.

That caused the coordinator-side KVCacheManager to be initialized with world_size=1 during a TP2 run, which is incorrect for the worker IPC / allocator setup that follows.

Changes

  • Stop querying TP world size from vLLM parallel_state in KVCacheCoordinatorPatch
  • Reuse kvcached.integration.vllm.interfaces._world_size, which was already recorded during EngineCore init
  • Add a regression test covering the timing case where parallel_state still reports 1 but the recorded EngineCore TP size is 2

Why this approach

The coordinator should not rediscover TP size from a startup-time API that is not yet stable for this path.

The EngineCore patch already computes and records the correct value earlier in the same startup sequence, so reusing that value is both narrower and more reliable.

Validation

  • Added unit test: python -m pytest tests/test_tp_world_size_patch.py -q
  • Observed in remote TP2 reproduction that the allocator setup moved from world_size=1 to world_size=2 after this fix
  • This fix alone moved startup past the earlier incorrect-world-size point and exposed the next TP startup issue

Notes

This is intentionally split from the prealloc timing fix so the two TP startup defects can be reviewed independently.

Copilot AI review requested due to automatic review settings May 21, 2026 10:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the vLLM integration patch so KVCacheCoordinator uses the tensor-parallel (TP) world size captured during EngineCore initialization, and adds a regression test to verify the behavior.

Changes:

  • Switch TP world size detection from vllm.distributed.parallel_state to kvcached.integration.vllm.interfaces._world_size.
  • Add a new unit test that verifies init_kvcached(world_size=...) uses the EngineCore-recorded world size.
  • Add supporting module stubs/mocks for torch, vllm, and kvcached integration imports in the new test.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tests/test_tp_world_size_patch.py Adds a regression test ensuring coordinator init uses the EngineCore-recorded TP world size.
kvcached/integration/vllm/patches.py Uses interfaces._world_size for TP size during coordinator setup to avoid early-startup world_size=1 observations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_tp_world_size_patch.py Outdated
Comment thread tests/test_tp_world_size_patch.py Outdated
Comment thread kvcached/integration/vllm/patches.py Outdated
Comment thread kvcached/integration/vllm/patches.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants