Skip to content

upgrade vllm to 0.18.1#112

Merged
ceci3 merged 9 commits intoflagos-ai:mainfrom
ceci3:upgrade_plugin
Apr 6, 2026
Merged

upgrade vllm to 0.18.1#112
ceci3 merged 9 commits intoflagos-ai:mainfrom
ceci3:upgrade_plugin

Conversation

@ceci3
Copy link
Copy Markdown
Collaborator

@ceci3 ceci3 commented Mar 31, 2026

PR Category

Core

PR Type

Improvements

Description

upgrade vllm to 0.18.1

Related Issues

Changes

Testing

Checklist

  • I have run the existing tests and they pass
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)



# Re-export CUDAGraphStat for compatibility
from vllm.compilation.cuda_graph import CUDAGraphStat # noqa: F401, E402

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'CUDAGraphStat' is not used.
try:
from vllm.model_executor.offloader.base import get_offloader
get_offloader().sync_prev_onload()
except (ImportError, RuntimeError):

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
try:
from vllm.model_executor.offloader.base import get_offloader
get_offloader().join_after_forward()
except (ImportError, RuntimeError):

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
try:
from vllm.model_executor.offloader.base import get_offloader
get_offloader().sync_prev_onload()
except (ImportError, RuntimeError):

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 31, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
4 out of 5 committers have signed the CLA.

✅ ceci3
✅ keennddyl
✅ xmhubj
✅ li199959
❌ cyberpioneer


cyberpioneer seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.


def supports_fp8(cls) -> bool:
if cls.vendor_name == "nvidia":
return True

Check notice

Code scanning / CodeQL

First parameter of a method is not named 'self' Note

Normal methods should have 'self', rather than 'cls', as their first parameter.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the vllm-plugin-FL integration to be compatible with vLLM 0.18.x by migrating to the new vllm.v1.* attention APIs, adjusting worker/model loading flows, and removing several FL model/config implementations that are now upstream in vLLM.

Changes:

  • Update platform/worker/dispatch codepaths to match vLLM 0.18.x APIs (notably vllm.v1.attention.*, config scoping, profiling/compile return values).
  • Refresh fused MoE and vendor backend integrations to new module locations and kernel entrypoints.
  • Remove FL-local model/config files now provided upstream; update docs and dispatch configs accordingly.

Reviewed changes

Copilot reviewed 26 out of 40 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vllm_fl/worker/worker.py Align worker APIs with vLLM 0.18.x (config scoping, load_model signature, new helper methods).
vllm_fl/utils.py Whitespace cleanup.
vllm_fl/platform.py Migrate to vllm.v1.attention.* APIs; adjust device metadata and platform hooks.
vllm_fl/patches/glm_moe_dsa.py Remove now-unneeded MLA recognition patch; keep remaining GLM-5 patches.
vllm_fl/ops/fused_moe/layer.py Update fused-moe router imports and forward_native signature for vLLM 0.18.x.
vllm_fl/ops/fused_moe/fused_moe.py Update fused-moe kernel dispatch API; normalize activation enum handling; keep chunking behavior.
vllm_fl/ops/custom_ops.py Formatting/indent alignment.
vllm_fl/models/qwen3_next.py Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/qwen3_5.py Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/minicpmo.py Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/kimi_k25.py Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/glm_moe_dsa.py Removed (wrapper no longer registered in plugin).
vllm_fl/dispatch/logger_manager.py Change log format to include logger name instead of filename/lineno.
vllm_fl/dispatch/config/utils.py Minor whitespace/format cleanup.
vllm_fl/dispatch/config/nvidia.yaml Enable allow_vendors: [cuda] in NVIDIA platform config.
vllm_fl/dispatch/config/init.py Import ordering tweak.
vllm_fl/dispatch/builtin_ops.py Adjust vendor backend discovery/registration (plus new loop added).
vllm_fl/dispatch/backends/vendor/metax/metax.py Switch registry imports to vllm.v1.attention.backends.registry.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/utils/fa_utils.py Replace upstream logger import with a local logger instantiation.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/ops/merge_attn_states.py Update import path to vllm.v1.attention.ops.triton_merge_attn_states.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/mla/flashmla.py Migrate to vLLM v1 attention backend APIs and batch-invariant flag usage.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/mla/common.py Migrate to vLLM v1 attention backend APIs and logging import changes.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/flash_attn.py Migrate to vLLM v1 attention backend APIs and batch-invariant flag usage.
vllm_fl/dispatch/backends/vendor/iluvatar/iluvatar.py Update registry import to vllm.v1.attention.backends.registry.
vllm_fl/dispatch/backends/vendor/cuda/impl/activation.py Call vLLM C++ ops via torch.ops._C.* instead of vllm._custom_ops.
vllm_fl/dispatch/backends/vendor/cuda/cuda.py Update registry import to vllm.v1.attention.backends.registry.
vllm_fl/dispatch/backends/vendor/ascend/impl/mm_encoder_attention.py Update MM encoder attention import path to vllm.model_executor.layers.attention.*.
vllm_fl/dispatch/backends/vendor/ascend/impl/causal_conv1d.py Update PAD_SLOT_ID import path to vllm.v1.attention.backends.utils.
vllm_fl/dispatch/backends/vendor/ascend/impl/attention.py Migrate ascend attention backend to vLLM v1 attention backend APIs.
vllm_fl/dispatch/backends/reference/reference.py Update registry import to vllm.v1.attention.backends.registry.
vllm_fl/dispatch/backends/flaggems/impl/mla.py Update MLA/common imports to vLLM v1 locations.
vllm_fl/dispatch/backends/flaggems/impl/custom_attention.py Update registry import to vllm.v1.attention.backends.registry.
vllm_fl/dispatch/backends/flaggems/impl/attention.py Migrate FlagGems attention backend to vLLM v1 attention backend APIs.
vllm_fl/dispatch/backends/flaggems/flaggems.py Update registry import to vllm.v1.attention.backends.registry.
vllm_fl/configs/qwen3_5_moe.py Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/compilation/graph.py Introduce platform-agnostic graph capture wrapper; add weak-set tracking and offloader sync hooks.
vllm_fl/attention/utils.py Update MM encoder attention patching imports to new vLLM locations.
vllm_fl/init.py Stop registering models/configs that are upstream; keep GLM-5 config registration + patches.
README.md Update install docs to vLLM v0.18.1 and pin FlagGems checkout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +77 to +79
for vendor_name in os.listdir(_VENDOR_BACKENDS_DIR):
vendor_path = os.path.join(_VENDOR_BACKENDS_DIR, vendor_name)

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop over vendor backends (vendor_name/vendor_path) has no side effects and is immediately followed by a set comprehension that re-lists the same directory. This is dead code and adds unnecessary work; remove the loop or incorporate its intended logic (e.g., filtering/validation) into available_vendor_dirs construction.

Suggested change
for vendor_name in os.listdir(_VENDOR_BACKENDS_DIR):
vendor_path = os.path.join(_VENDOR_BACKENDS_DIR, vendor_name)

Copilot uses AI. Check for mistakes.
Comment on lines 687 to +691
# Reset the seed to ensure that the random state is not affected by
# the model initialization and profiling.
set_random_seed(self.model_config.seed)

return self.compilation_config.compilation_time
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compile_or_warm_up_model returns self.compilation_config.compilation_time, but WorkerFL never defines self.compilation_config (only local compilation_config variables are used earlier). This will raise AttributeError at runtime. Return vllm_config.compilation_config.compilation_time (or another existing attribute) instead.

Copilot uses AI. Check for mistakes.
Comment on lines +341 to +345
def supports_fp8(cls) -> bool:
if cls.vendor_name == "nvidia":
return True
return False

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supports_fp8 is defined without @classmethod, but it uses a cls parameter and matches other PlatformFL class APIs. Calling PlatformFL.supports_fp8() will raise a missing-argument TypeError. Add @classmethod (or change it to an instance method and update callers accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines 181 to 185
with ExitStack() as stack:
if self.graph_options.gc_disable:
# during every model forward for piecewise graph
# mode, we will capture many pieces of graphs
# (roughly one per layer). running gc again and again
# across layers will make the graph capture very slow.
# therefore, we only run gc for the first graph,
# and disable gc for the rest of the graphs.
if self.cudagraph_options.gc_disable:
stack.enter_context(patch("gc.collect", lambda: None))
# FL-specific: patch our platform's empty_cache
stack.enter_context(
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExitStack scope ends before graph capture starts, so the gc_disable patches (gc.collect / PlatformFL.empty_cache) are not active during capture. Move the capture setup inside the ExitStack so the patches apply for the whole capture.

Copilot uses AI. Check for mistakes.
stack.enter_context(
patch("vllm_fl.platform.PlatformFL.empty_cache", lambda: None)
patch("vllm_fl.platform.PlatformFL.empty_cache",
lambda: None)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patched replacement for PlatformFL.empty_cache is lambda: None. Since empty_cache is normally a @classmethod and may be invoked via the current_platform instance, the patched function can receive an implicit argument and raise TypeError. Use a replacement that accepts *args/**kwargs (or wrap with classmethod) to keep the call signature compatible.

Suggested change
lambda: None)
lambda *args, **kwargs: None)

Copilot uses AI. Check for mistakes.
Comment on lines 34 to 38

### Setup

1. Install vllm from the official [v0.13.0](https://github.com/vllm-project/vllm/tree/v0.13.0) (optional if the correct version is installed) or from the fork [vllm-FL](https://github.com/flagos-ai/vllm-FL).
1. Install vllm from the official [v0.18.1](https://github.com/vllm-project/vllm/tree/v0.18.1) (optional if the correct version is installed) or from the fork [vllm-FL](https://github.com/flagos-ai/vllm-FL).

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title/description says upgrade to vLLM 0.18.0, but this change updates docs/source references to v0.18.1 (and several file headers also reference v0.18.1). Please align the stated target version (either update PR metadata to 0.18.1 or change the references back to 0.18.0) to avoid confusion about the required dependency version.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the tests label Apr 2, 2026
@ceci3 ceci3 changed the title upgrade vllm to 0.18.0 upgrade vllm to 0.18.1 Apr 3, 2026
ceci3 and others added 4 commits April 3, 2026 17:05
### PR Category
<!-- One of [Core | Vendor | OP | Tools | Others] -->
Vendor

### PR Type
<!-- One of [User Experience | New Features | Bug Fixes | Improvements |
Performance | Breaking Change | Deprecations | Test Case | Docs |
Others] -->
New Features

### Description
<!-- Describe what this PR does and why. -->
This pull request adds support for the MUSA hardware backend throughout
the codebase, enabling vLLM-FL to run on MUSA devices with appropriate
configuration, device handling, and operator dispatch. The main changes
include platform detection, device context management, configuration
updates, and backend selection logic for MUSA.

Platform and Device Support:

* Added detection and handling for the "musa" platform in platform
utilities and device capability queries, including `is_cuda_alike`,
`is_cuda`, and a new `is_musa` method in `platform.py`
[[1]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57R64-R77)
[[2]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57L107-R115)
[[3]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57R153-R155)
[[4]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57L337-R350)
[[5]](diffhunk://#diff-55ef01748202be695636b2c430841f377ace4d833eddea714e620f6cac70fd9eL49-R49).
* Updated device context management in `flagcx.py` to use
`torch.musa.device` when running on MUSA hardware.

Operator Dispatch and Configuration:

* Introduced a new `musa.yaml` dispatch configuration file specifying
backend preferences, operator backend order, and blacklists for MUSA
hardware.

Graph and Execution Support:

* Added support for `torch.musa.MUSAGraph` in the graph compilation
logic to enable graph execution on MUSA devices.

These changes collectively ensure that vLLM-FL can detect, configure,
and efficiently utilize MUSA hardware in a manner similar to CUDA and
other supported platforms.

### Related Issues
<!-- Link any related issues: Fixes #issue, Closes #issue, or Related to
#issue -->

### Changes
<!-- List the key changes made in this PR. -->
-

### Testing
<!-- How has this change been tested? Include test commands, hardware
used, etc. -->
-

### Checklist
- [ ] I have run the existing tests and they pass
- [ ] I have added tests for my changes (if applicable)
- [ ] I have updated the documentation (if applicable)

---------

Co-authored-by: jiamingwang-mt <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
Co-authored-by: keennddyl <[email protected]>
Co-authored-by: hozier <[email protected]>
Co-authored-by: cyber-pioneer <[email protected]>
Others

Test Case

<!-- Describe what this PR does and why. -->
Add e2e precision test for BGE-M3 embedding model covering three pooling
modes:
- **Dense**: cosine similarity between query and passage embeddings
- **Lexical (sparse BM25)**: weighted token overlap score via
`/tokenize` + `/pooling` (task=token_classify)
- **ColBERT (multi-vector)**: MaxSim score via `/pooling`
(task=token_embed)

Also removes the outdated vLLM 0.13 backport note from README and
updates the BAAI/bge-m3 entry to point to the implementation.

<!-- Link any related issues: Fixes #issue, Closes #issue, or Related to
-

<!-- List the key changes made in this PR. -->
- `tests/e2e_tests/serving/test_bge_m3.py`: New e2e test (179 lines) —
dense, lexical, ColBERT precision validation
- `README.md`: Remove vLLM 0.13 backport section; update BAAI/bge-m3 row
to link to implementation

<!-- How has this change been tested? Include test commands, hardware
used, etc. -->
- `pytest tests/e2e_tests/serving/test_bge_m3.py -v`
- Requires server running: `vllm serve BAAI/bge-m3 --hf-overrides
'{"architectures":["BgeM3EmbeddingModel"]}'`

- [x] I have run the existing tests and they pass
- [x] I have added tests for my changes
- [] I have updated the documentation

---------

Co-authored-by: ceci3 <[email protected]>
### PR Category
CICD

### PR Type
CI images and CI workflow configs

### Description
Catch up the versions fo vLLM

### Related Issues
No

### Changes
- dockerfiles
- ci workflow config

### Testing
The workflow should be broken till core changes made in vllm-fl

### Checklist
@github-actions github-actions bot added the build label Apr 6, 2026
@ceci3 ceci3 merged commit 4dfed0f into flagos-ai:main Apr 6, 2026
12 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants