Conversation
|
|
||
|
|
||
| # Re-export CUDAGraphStat for compatibility | ||
| from vllm.compilation.cuda_graph import CUDAGraphStat # noqa: F401, E402 |
Check notice
Code scanning / CodeQL
Unused import Note
| try: | ||
| from vllm.model_executor.offloader.base import get_offloader | ||
| get_offloader().sync_prev_onload() | ||
| except (ImportError, RuntimeError): |
Check notice
Code scanning / CodeQL
Empty except Note
| try: | ||
| from vllm.model_executor.offloader.base import get_offloader | ||
| get_offloader().join_after_forward() | ||
| except (ImportError, RuntimeError): |
Check notice
Code scanning / CodeQL
Empty except Note
| try: | ||
| from vllm.model_executor.offloader.base import get_offloader | ||
| get_offloader().sync_prev_onload() | ||
| except (ImportError, RuntimeError): |
Check notice
Code scanning / CodeQL
Empty except Note
|
cyberpioneer seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Pull request overview
This PR updates the vllm-plugin-FL integration to be compatible with vLLM 0.18.x by migrating to the new vllm.v1.* attention APIs, adjusting worker/model loading flows, and removing several FL model/config implementations that are now upstream in vLLM.
Changes:
- Update platform/worker/dispatch codepaths to match vLLM 0.18.x APIs (notably
vllm.v1.attention.*, config scoping, profiling/compile return values). - Refresh fused MoE and vendor backend integrations to new module locations and kernel entrypoints.
- Remove FL-local model/config files now provided upstream; update docs and dispatch configs accordingly.
Reviewed changes
Copilot reviewed 26 out of 40 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_fl/worker/worker.py | Align worker APIs with vLLM 0.18.x (config scoping, load_model signature, new helper methods). |
| vllm_fl/utils.py | Whitespace cleanup. |
| vllm_fl/platform.py | Migrate to vllm.v1.attention.* APIs; adjust device metadata and platform hooks. |
| vllm_fl/patches/glm_moe_dsa.py | Remove now-unneeded MLA recognition patch; keep remaining GLM-5 patches. |
| vllm_fl/ops/fused_moe/layer.py | Update fused-moe router imports and forward_native signature for vLLM 0.18.x. |
| vllm_fl/ops/fused_moe/fused_moe.py | Update fused-moe kernel dispatch API; normalize activation enum handling; keep chunking behavior. |
| vllm_fl/ops/custom_ops.py | Formatting/indent alignment. |
| vllm_fl/models/qwen3_next.py | Removed (now upstream in vLLM 0.18.1 per PR notes). |
| vllm_fl/models/qwen3_5.py | Removed (now upstream in vLLM 0.18.1 per PR notes). |
| vllm_fl/models/minicpmo.py | Removed (now upstream in vLLM 0.18.1 per PR notes). |
| vllm_fl/models/kimi_k25.py | Removed (now upstream in vLLM 0.18.1 per PR notes). |
| vllm_fl/models/glm_moe_dsa.py | Removed (wrapper no longer registered in plugin). |
| vllm_fl/dispatch/logger_manager.py | Change log format to include logger name instead of filename/lineno. |
| vllm_fl/dispatch/config/utils.py | Minor whitespace/format cleanup. |
| vllm_fl/dispatch/config/nvidia.yaml | Enable allow_vendors: [cuda] in NVIDIA platform config. |
| vllm_fl/dispatch/config/init.py | Import ordering tweak. |
| vllm_fl/dispatch/builtin_ops.py | Adjust vendor backend discovery/registration (plus new loop added). |
| vllm_fl/dispatch/backends/vendor/metax/metax.py | Switch registry imports to vllm.v1.attention.backends.registry. |
| vllm_fl/dispatch/backends/vendor/metax/impl/attention/utils/fa_utils.py | Replace upstream logger import with a local logger instantiation. |
| vllm_fl/dispatch/backends/vendor/metax/impl/attention/ops/merge_attn_states.py | Update import path to vllm.v1.attention.ops.triton_merge_attn_states. |
| vllm_fl/dispatch/backends/vendor/metax/impl/attention/mla/flashmla.py | Migrate to vLLM v1 attention backend APIs and batch-invariant flag usage. |
| vllm_fl/dispatch/backends/vendor/metax/impl/attention/mla/common.py | Migrate to vLLM v1 attention backend APIs and logging import changes. |
| vllm_fl/dispatch/backends/vendor/metax/impl/attention/flash_attn.py | Migrate to vLLM v1 attention backend APIs and batch-invariant flag usage. |
| vllm_fl/dispatch/backends/vendor/iluvatar/iluvatar.py | Update registry import to vllm.v1.attention.backends.registry. |
| vllm_fl/dispatch/backends/vendor/cuda/impl/activation.py | Call vLLM C++ ops via torch.ops._C.* instead of vllm._custom_ops. |
| vllm_fl/dispatch/backends/vendor/cuda/cuda.py | Update registry import to vllm.v1.attention.backends.registry. |
| vllm_fl/dispatch/backends/vendor/ascend/impl/mm_encoder_attention.py | Update MM encoder attention import path to vllm.model_executor.layers.attention.*. |
| vllm_fl/dispatch/backends/vendor/ascend/impl/causal_conv1d.py | Update PAD_SLOT_ID import path to vllm.v1.attention.backends.utils. |
| vllm_fl/dispatch/backends/vendor/ascend/impl/attention.py | Migrate ascend attention backend to vLLM v1 attention backend APIs. |
| vllm_fl/dispatch/backends/reference/reference.py | Update registry import to vllm.v1.attention.backends.registry. |
| vllm_fl/dispatch/backends/flaggems/impl/mla.py | Update MLA/common imports to vLLM v1 locations. |
| vllm_fl/dispatch/backends/flaggems/impl/custom_attention.py | Update registry import to vllm.v1.attention.backends.registry. |
| vllm_fl/dispatch/backends/flaggems/impl/attention.py | Migrate FlagGems attention backend to vLLM v1 attention backend APIs. |
| vllm_fl/dispatch/backends/flaggems/flaggems.py | Update registry import to vllm.v1.attention.backends.registry. |
| vllm_fl/configs/qwen3_5_moe.py | Removed (now upstream in vLLM 0.18.1 per PR notes). |
| vllm_fl/compilation/graph.py | Introduce platform-agnostic graph capture wrapper; add weak-set tracking and offloader sync hooks. |
| vllm_fl/attention/utils.py | Update MM encoder attention patching imports to new vLLM locations. |
| vllm_fl/init.py | Stop registering models/configs that are upstream; keep GLM-5 config registration + patches. |
| README.md | Update install docs to vLLM v0.18.1 and pin FlagGems checkout. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vllm_fl/dispatch/builtin_ops.py
Outdated
| for vendor_name in os.listdir(_VENDOR_BACKENDS_DIR): | ||
| vendor_path = os.path.join(_VENDOR_BACKENDS_DIR, vendor_name) | ||
|
|
There was a problem hiding this comment.
The loop over vendor backends (vendor_name/vendor_path) has no side effects and is immediately followed by a set comprehension that re-lists the same directory. This is dead code and adds unnecessary work; remove the loop or incorporate its intended logic (e.g., filtering/validation) into available_vendor_dirs construction.
| for vendor_name in os.listdir(_VENDOR_BACKENDS_DIR): | |
| vendor_path = os.path.join(_VENDOR_BACKENDS_DIR, vendor_name) |
| # Reset the seed to ensure that the random state is not affected by | ||
| # the model initialization and profiling. | ||
| set_random_seed(self.model_config.seed) | ||
|
|
||
| return self.compilation_config.compilation_time |
There was a problem hiding this comment.
compile_or_warm_up_model returns self.compilation_config.compilation_time, but WorkerFL never defines self.compilation_config (only local compilation_config variables are used earlier). This will raise AttributeError at runtime. Return vllm_config.compilation_config.compilation_time (or another existing attribute) instead.
| def supports_fp8(cls) -> bool: | ||
| if cls.vendor_name == "nvidia": | ||
| return True | ||
| return False | ||
|
|
There was a problem hiding this comment.
supports_fp8 is defined without @classmethod, but it uses a cls parameter and matches other PlatformFL class APIs. Calling PlatformFL.supports_fp8() will raise a missing-argument TypeError. Add @classmethod (or change it to an instance method and update callers accordingly).
| with ExitStack() as stack: | ||
| if self.graph_options.gc_disable: | ||
| # during every model forward for piecewise graph | ||
| # mode, we will capture many pieces of graphs | ||
| # (roughly one per layer). running gc again and again | ||
| # across layers will make the graph capture very slow. | ||
| # therefore, we only run gc for the first graph, | ||
| # and disable gc for the rest of the graphs. | ||
| if self.cudagraph_options.gc_disable: | ||
| stack.enter_context(patch("gc.collect", lambda: None)) | ||
| # FL-specific: patch our platform's empty_cache | ||
| stack.enter_context( |
There was a problem hiding this comment.
The ExitStack scope ends before graph capture starts, so the gc_disable patches (gc.collect / PlatformFL.empty_cache) are not active during capture. Move the capture setup inside the ExitStack so the patches apply for the whole capture.
| stack.enter_context( | ||
| patch("vllm_fl.platform.PlatformFL.empty_cache", lambda: None) | ||
| patch("vllm_fl.platform.PlatformFL.empty_cache", | ||
| lambda: None) |
There was a problem hiding this comment.
The patched replacement for PlatformFL.empty_cache is lambda: None. Since empty_cache is normally a @classmethod and may be invoked via the current_platform instance, the patched function can receive an implicit argument and raise TypeError. Use a replacement that accepts *args/**kwargs (or wrap with classmethod) to keep the call signature compatible.
| lambda: None) | |
| lambda *args, **kwargs: None) |
|
|
||
| ### Setup | ||
|
|
||
| 1. Install vllm from the official [v0.13.0](https://github.com/vllm-project/vllm/tree/v0.13.0) (optional if the correct version is installed) or from the fork [vllm-FL](https://github.com/flagos-ai/vllm-FL). | ||
| 1. Install vllm from the official [v0.18.1](https://github.com/vllm-project/vllm/tree/v0.18.1) (optional if the correct version is installed) or from the fork [vllm-FL](https://github.com/flagos-ai/vllm-FL). | ||
|
|
There was a problem hiding this comment.
PR title/description says upgrade to vLLM 0.18.0, but this change updates docs/source references to v0.18.1 (and several file headers also reference v0.18.1). Please align the stated target version (either update PR metadata to 0.18.1 or change the references back to 0.18.0) to avoid confusion about the required dependency version.
### PR Category <!-- One of [Core | Vendor | OP | Tools | Others] --> Vendor ### PR Type <!-- One of [User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change | Deprecations | Test Case | Docs | Others] --> New Features ### Description <!-- Describe what this PR does and why. --> This pull request adds support for the MUSA hardware backend throughout the codebase, enabling vLLM-FL to run on MUSA devices with appropriate configuration, device handling, and operator dispatch. The main changes include platform detection, device context management, configuration updates, and backend selection logic for MUSA. Platform and Device Support: * Added detection and handling for the "musa" platform in platform utilities and device capability queries, including `is_cuda_alike`, `is_cuda`, and a new `is_musa` method in `platform.py` [[1]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57R64-R77) [[2]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57L107-R115) [[3]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57R153-R155) [[4]](diffhunk://#diff-e62f96d38d994f2068a59e290d710e4900afc1b54bd4f58334de77c01c233c57L337-R350) [[5]](diffhunk://#diff-55ef01748202be695636b2c430841f377ace4d833eddea714e620f6cac70fd9eL49-R49). * Updated device context management in `flagcx.py` to use `torch.musa.device` when running on MUSA hardware. Operator Dispatch and Configuration: * Introduced a new `musa.yaml` dispatch configuration file specifying backend preferences, operator backend order, and blacklists for MUSA hardware. Graph and Execution Support: * Added support for `torch.musa.MUSAGraph` in the graph compilation logic to enable graph execution on MUSA devices. These changes collectively ensure that vLLM-FL can detect, configure, and efficiently utilize MUSA hardware in a manner similar to CUDA and other supported platforms. ### Related Issues <!-- Link any related issues: Fixes #issue, Closes #issue, or Related to #issue --> ### Changes <!-- List the key changes made in this PR. --> - ### Testing <!-- How has this change been tested? Include test commands, hardware used, etc. --> - ### Checklist - [ ] I have run the existing tests and they pass - [ ] I have added tests for my changes (if applicable) - [ ] I have updated the documentation (if applicable) --------- Co-authored-by: jiamingwang-mt <[email protected]> Co-authored-by: Copilot Autofix powered by AI <[email protected]> Co-authored-by: keennddyl <[email protected]> Co-authored-by: hozier <[email protected]> Co-authored-by: cyber-pioneer <[email protected]>
Others
Test Case
<!-- Describe what this PR does and why. -->
Add e2e precision test for BGE-M3 embedding model covering three pooling
modes:
- **Dense**: cosine similarity between query and passage embeddings
- **Lexical (sparse BM25)**: weighted token overlap score via
`/tokenize` + `/pooling` (task=token_classify)
- **ColBERT (multi-vector)**: MaxSim score via `/pooling`
(task=token_embed)
Also removes the outdated vLLM 0.13 backport note from README and
updates the BAAI/bge-m3 entry to point to the implementation.
<!-- Link any related issues: Fixes #issue, Closes #issue, or Related to
-
<!-- List the key changes made in this PR. -->
- `tests/e2e_tests/serving/test_bge_m3.py`: New e2e test (179 lines) —
dense, lexical, ColBERT precision validation
- `README.md`: Remove vLLM 0.13 backport section; update BAAI/bge-m3 row
to link to implementation
<!-- How has this change been tested? Include test commands, hardware
used, etc. -->
- `pytest tests/e2e_tests/serving/test_bge_m3.py -v`
- Requires server running: `vllm serve BAAI/bge-m3 --hf-overrides
'{"architectures":["BgeM3EmbeddingModel"]}'`
- [x] I have run the existing tests and they pass
- [x] I have added tests for my changes
- [] I have updated the documentation
---------
Co-authored-by: ceci3 <[email protected]>
### PR Category CICD ### PR Type CI images and CI workflow configs ### Description Catch up the versions fo vLLM ### Related Issues No ### Changes - dockerfiles - ci workflow config ### Testing The workflow should be broken till core changes made in vllm-fl ### Checklist
PR Category
Core
PR Type
Improvements
Description
upgrade vllm to 0.18.1
Related Issues
Changes
Testing
Checklist