upgrade vllm to 0.18.1 by ceci3 · Pull Request #112 · flagos-ai/vllm-plugin-FL

ceci3 · 2026-03-31T07:19:07Z

PR Category

Core

PR Type

Improvements

Description

upgrade vllm to 0.18.1

Related Issues

Changes

Testing

Checklist

I have run the existing tests and they pass
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)

vllm_fl/compilation/graph.py


+
+# Re-export CUDAGraphStat for compatibility
+from vllm.compilation.cuda_graph import CUDAGraphStat  # noqa: F401, E402


vllm_fl/compilation/graph.py

+            try:
+                from vllm.model_executor.offloader.base import get_offloader
+                get_offloader().sync_prev_onload()
+            except (ImportError, RuntimeError):


vllm_fl/compilation/graph.py

+                try:
+                    from vllm.model_executor.offloader.base import get_offloader
+                    get_offloader().join_after_forward()
+                except (ImportError, RuntimeError):


vllm_fl/compilation/graph.py

+        try:
+            from vllm.model_executor.offloader.base import get_offloader
+            get_offloader().sync_prev_onload()
+        except (ImportError, RuntimeError):


vllm_fl/dispatch/backends/vendor/__init__.py

CLAassistant · 2026-03-31T07:22:09Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
4 out of 5 committers have signed the CLA.

✅ ceci3
✅ keennddyl
✅ xmhubj
✅ li199959
❌ cyberpioneer

cyberpioneer seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

vllm_fl/dispatch/builtin_ops.py

vllm_fl/platform.py


+    def supports_fp8(cls) -> bool:
+        if cls.vendor_name == "nvidia":
+            return True


Copilot

Pull request overview

This PR updates the vllm-plugin-FL integration to be compatible with vLLM 0.18.x by migrating to the new vllm.v1.* attention APIs, adjusting worker/model loading flows, and removing several FL model/config implementations that are now upstream in vLLM.

Changes:

Update platform/worker/dispatch codepaths to match vLLM 0.18.x APIs (notably vllm.v1.attention.*, config scoping, profiling/compile return values).
Refresh fused MoE and vendor backend integrations to new module locations and kernel entrypoints.
Remove FL-local model/config files now provided upstream; update docs and dispatch configs accordingly.

Reviewed changes

Copilot reviewed 26 out of 40 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
vllm_fl/worker/worker.py	Align worker APIs with vLLM 0.18.x (config scoping, load_model signature, new helper methods).
vllm_fl/utils.py	Whitespace cleanup.
vllm_fl/platform.py	Migrate to `vllm.v1.attention.*` APIs; adjust device metadata and platform hooks.
vllm_fl/patches/glm_moe_dsa.py	Remove now-unneeded MLA recognition patch; keep remaining GLM-5 patches.
vllm_fl/ops/fused_moe/layer.py	Update fused-moe router imports and `forward_native` signature for vLLM 0.18.x.
vllm_fl/ops/fused_moe/fused_moe.py	Update fused-moe kernel dispatch API; normalize activation enum handling; keep chunking behavior.
vllm_fl/ops/custom_ops.py	Formatting/indent alignment.
vllm_fl/models/qwen3_next.py	Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/qwen3_5.py	Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/minicpmo.py	Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/kimi_k25.py	Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/models/glm_moe_dsa.py	Removed (wrapper no longer registered in plugin).
vllm_fl/dispatch/logger_manager.py	Change log format to include logger name instead of filename/lineno.
vllm_fl/dispatch/config/utils.py	Minor whitespace/format cleanup.
vllm_fl/dispatch/config/nvidia.yaml	Enable `allow_vendors: [cuda]` in NVIDIA platform config.
vllm_fl/dispatch/config/init.py	Import ordering tweak.
vllm_fl/dispatch/builtin_ops.py	Adjust vendor backend discovery/registration (plus new loop added).
vllm_fl/dispatch/backends/vendor/metax/metax.py	Switch registry imports to `vllm.v1.attention.backends.registry`.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/utils/fa_utils.py	Replace upstream logger import with a local logger instantiation.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/ops/merge_attn_states.py	Update import path to `vllm.v1.attention.ops.triton_merge_attn_states`.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/mla/flashmla.py	Migrate to vLLM v1 attention backend APIs and batch-invariant flag usage.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/mla/common.py	Migrate to vLLM v1 attention backend APIs and logging import changes.
vllm_fl/dispatch/backends/vendor/metax/impl/attention/flash_attn.py	Migrate to vLLM v1 attention backend APIs and batch-invariant flag usage.
vllm_fl/dispatch/backends/vendor/iluvatar/iluvatar.py	Update registry import to `vllm.v1.attention.backends.registry`.
vllm_fl/dispatch/backends/vendor/cuda/impl/activation.py	Call vLLM C++ ops via `torch.ops._C.*` instead of `vllm._custom_ops`.
vllm_fl/dispatch/backends/vendor/cuda/cuda.py	Update registry import to `vllm.v1.attention.backends.registry`.
vllm_fl/dispatch/backends/vendor/ascend/impl/mm_encoder_attention.py	Update MM encoder attention import path to `vllm.model_executor.layers.attention.*`.
vllm_fl/dispatch/backends/vendor/ascend/impl/causal_conv1d.py	Update PAD_SLOT_ID import path to `vllm.v1.attention.backends.utils`.
vllm_fl/dispatch/backends/vendor/ascend/impl/attention.py	Migrate ascend attention backend to vLLM v1 attention backend APIs.
vllm_fl/dispatch/backends/reference/reference.py	Update registry import to `vllm.v1.attention.backends.registry`.
vllm_fl/dispatch/backends/flaggems/impl/mla.py	Update MLA/common imports to vLLM v1 locations.
vllm_fl/dispatch/backends/flaggems/impl/custom_attention.py	Update registry import to `vllm.v1.attention.backends.registry`.
vllm_fl/dispatch/backends/flaggems/impl/attention.py	Migrate FlagGems attention backend to vLLM v1 attention backend APIs.
vllm_fl/dispatch/backends/flaggems/flaggems.py	Update registry import to `vllm.v1.attention.backends.registry`.
vllm_fl/configs/qwen3_5_moe.py	Removed (now upstream in vLLM 0.18.1 per PR notes).
vllm_fl/compilation/graph.py	Introduce platform-agnostic graph capture wrapper; add weak-set tracking and offloader sync hooks.
vllm_fl/attention/utils.py	Update MM encoder attention patching imports to new vLLM locations.
vllm_fl/init.py	Stop registering models/configs that are upstream; keep GLM-5 config registration + patches.
README.md	Update install docs to vLLM v0.18.1 and pin FlagGems checkout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T12:02:01Z

vllm_fl/dispatch/builtin_ops.py

+    for vendor_name in os.listdir(_VENDOR_BACKENDS_DIR):
+        vendor_path = os.path.join(_VENDOR_BACKENDS_DIR, vendor_name)
+


The loop over vendor backends (vendor_name/vendor_path) has no side effects and is immediately followed by a set comprehension that re-lists the same directory. This is dead code and adds unnecessary work; remove the loop or incorporate its intended logic (e.g., filtering/validation) into available_vendor_dirs construction.

Suggested change

for vendor_name in os.listdir(_VENDOR_BACKENDS_DIR):

vendor_path = os.path.join(_VENDOR_BACKENDS_DIR, vendor_name)

Copilot · 2026-03-31T12:02:01Z

vllm_fl/worker/worker.py

        # Reset the seed to ensure that the random state is not affected by
        # the model initialization and profiling.
        set_random_seed(self.model_config.seed)

+        return self.compilation_config.compilation_time


compile_or_warm_up_model returns self.compilation_config.compilation_time, but WorkerFL never defines self.compilation_config (only local compilation_config variables are used earlier). This will raise AttributeError at runtime. Return vllm_config.compilation_config.compilation_time (or another existing attribute) instead.

Copilot · 2026-03-31T12:02:01Z

vllm_fl/platform.py

+    def supports_fp8(cls) -> bool:
+        if cls.vendor_name == "nvidia":
+            return True
+        return False
+


supports_fp8 is defined without @classmethod, but it uses a cls parameter and matches other PlatformFL class APIs. Calling PlatformFL.supports_fp8() will raise a missing-argument TypeError. Add @classmethod (or change it to an instance method and update callers accordingly).

Copilot · 2026-03-31T12:02:02Z

vllm_fl/compilation/graph.py

            with ExitStack() as stack:
-                if self.graph_options.gc_disable:
-                    # during every model forward for piecewise graph
-                    # mode, we will capture many pieces of graphs
-                    # (roughly one per layer). running gc again and again
-                    # across layers will make the graph capture very slow.
-                    # therefore, we only run gc for the first graph,
-                    # and disable gc for the rest of the graphs.
+                if self.cudagraph_options.gc_disable:
                    stack.enter_context(patch("gc.collect", lambda: None))
+                    # FL-specific: patch our platform's empty_cache
                    stack.enter_context(


The ExitStack scope ends before graph capture starts, so the gc_disable patches (gc.collect / PlatformFL.empty_cache) are not active during capture. Move the capture setup inside the ExitStack so the patches apply for the whole capture.

Copilot · 2026-03-31T12:02:02Z

vllm_fl/compilation/graph.py

                    stack.enter_context(
-                        patch("vllm_fl.platform.PlatformFL.empty_cache", lambda: None)
+                        patch("vllm_fl.platform.PlatformFL.empty_cache",
+                              lambda: None)


The patched replacement for PlatformFL.empty_cache is lambda: None. Since empty_cache is normally a @classmethod and may be invoked via the current_platform instance, the patched function can receive an implicit argument and raise TypeError. Use a replacement that accepts *args/**kwargs (or wrap with classmethod) to keep the call signature compatible.

Suggested change

lambda: None)

lambda *args, **kwargs: None)

Copilot · 2026-03-31T12:02:02Z

README.md


 ### Setup

-1. Install vllm from the official [v0.13.0](https://github.com/vllm-project/vllm/tree/v0.13.0) (optional if the correct version is installed) or from the fork [vllm-FL](https://github.com/flagos-ai/vllm-FL).
+1. Install vllm from the official [v0.18.1](https://github.com/vllm-project/vllm/tree/v0.18.1) (optional if the correct version is installed) or from the fork [vllm-FL](https://github.com/flagos-ai/vllm-FL).



PR title/description says upgrade to vLLM 0.18.0, but this change updates docs/source references to v0.18.1 (and several file headers also reference v0.18.1). Please align the stated target version (either update PR metadata to 0.18.1 or change the references back to 0.18.0) to avoid confusion about the required dependency version.

### PR Category  Vendor ### PR Type  New Features ### Description  This pull request adds support for the MUSA hardware backend throughout the codebase, enabling vLLM-FL to run on MUSA devices with appropriate configuration, device handling, and operator dispatch. The main changes include platform detection, device context management, configuration updates, and backend selection logic for MUSA. Platform and Device Support: * Added detection and handling for the "musa" platform in platform utilities and device capability queries, including `is_cuda_alike`, `is_cuda`, and a new `is_musa` method in `platform.py` [[1]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57R64-R77) [[2]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57L107-R115) [[3]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57R153-R155) [[4]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57L337-R350) [[5]](diffhunk://#diff-55ef01748202be695636b2c430841f377ace4d833eddea714e620f6cac70fd9eL49-R49). * Updated device context management in `flagcx.py` to use `torch.musa.device` when running on MUSA hardware. Operator Dispatch and Configuration: * Introduced a new `musa.yaml` dispatch configuration file specifying backend preferences, operator backend order, and blacklists for MUSA hardware. Graph and Execution Support: * Added support for `torch.musa.MUSAGraph` in the graph compilation logic to enable graph execution on MUSA devices. These changes collectively ensure that vLLM-FL can detect, configure, and efficiently utilize MUSA hardware in a manner similar to CUDA and other supported platforms. ### Related Issues  ### Changes  - ### Testing  - ### Checklist - [ ] I have run the existing tests and they pass - [ ] I have added tests for my changes (if applicable) - [ ] I have updated the documentation (if applicable) --------- Co-authored-by: jiamingwang-mt <[email protected]> Co-authored-by: Copilot Autofix powered by AI <[email protected]> Co-authored-by: keennddyl <[email protected]> Co-authored-by: hozier <[email protected]> Co-authored-by: cyber-pioneer <[email protected]>

Others Test Case  Add e2e precision test for BGE-M3 embedding model covering three pooling modes: - **Dense**: cosine similarity between query and passage embeddings - **Lexical (sparse BM25)**: weighted token overlap score via `/tokenize` + `/pooling` (task=token_classify) - **ColBERT (multi-vector)**: MaxSim score via `/pooling` (task=token_embed) Also removes the outdated vLLM 0.13 backport note from README and updates the BAAI/bge-m3 entry to point to the implementation.  - `tests/e2e_tests/serving/test_bge_m3.py`: New e2e test (179 lines) — dense, lexical, ColBERT precision validation - `README.md`: Remove vLLM 0.13 backport section; update BAAI/bge-m3 row to link to implementation  - `pytest tests/e2e_tests/serving/test_bge_m3.py -v` - Requires server running: `vllm serve BAAI/bge-m3 --hf-overrides '{"architectures":["BgeM3EmbeddingModel"]}'` - [x] I have run the existing tests and they pass - [x] I have added tests for my changes - [] I have updated the documentation --------- Co-authored-by: ceci3 <[email protected]>

### PR Category CICD ### PR Type CI images and CI workflow configs ### Description Catch up the versions fo vLLM ### Related Issues No ### Changes - dockerfiles - ci workflow config ### Testing The workflow should be broken till core changes made in vllm-fl ### Checklist

upgrade vllm

d5bb6d6

github-actions bot added docs core labels Mar 31, 2026

github-advanced-security bot found potential problems Mar 31, 2026

View reviewed changes

fix manully

a4f6351

ceci3 force-pushed the upgrade_plugin branch from 25edfe7 to a4f6351 Compare March 31, 2026 08:31

aoyulong requested a review from Copilot March 31, 2026 11:55

Copilot started reviewing on behalf of aoyulong March 31, 2026 11:55 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

fix unittest

c1fc8e7

github-actions bot added the tests label Apr 2, 2026

ceci3 changed the title ~~upgrade vllm to 0.18.0~~ upgrade vllm to 0.18.1 Apr 3, 2026

fix

3cfb835

github-actions bot added ci benchmarks examples labels Apr 3, 2026

ceci3 and others added 4 commits April 3, 2026 17:05

Merge branch 'main' into upgrade_plugin

5d9682a

github-actions bot added the build label Apr 6, 2026

resolve conflict

a6a3b11

github-actions bot removed ci build benchmarks examples labels Apr 6, 2026

ceci3 merged commit 4dfed0f into flagos-ai:main Apr 6, 2026
12 of 17 checks passed



		# Re-export CUDAGraphStat for compatibility
		from vllm.compilation.cuda_graph import CUDAGraphStat # noqa: F401, E402

		for vendor_name in os.listdir(_VENDOR_BACKENDS_DIR):
		vendor_path = os.path.join(_VENDOR_BACKENDS_DIR, vendor_name)

Conversation

ceci3 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Type

Description

Related Issues

Changes

Testing

Checklist

Uh oh!

Check notice

Check notice

Check notice

Check notice

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Check notice

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ceci3 commented Mar 31, 2026 •

edited

Loading

CLAassistant commented Mar 31, 2026 •

edited

Loading