Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
f475640
update 0.13.0
ceci3 Dec 31, 2025
25e1cd4
update readme
ceci3 Dec 31, 2025
8e62ce2
polish code
ceci3 Jan 4, 2026
cddd6f9
comment gems attention
ceci3 Jan 7, 2026
2864b24
add qwen3 next
ceci3 Jan 9, 2026
80a7487
Add a dispatch mechanism.
xin2an Jan 9, 2026
4a749f0
Adjusting the Vendor multi-backend structure
xin2an Jan 12, 2026
129c125
Merge remote-tracking branch 'remotes/pr_author/update_v0130' into di…
xin2an Jan 12, 2026
754927d
Add ascend support
xin2an Jan 19, 2026
4dd4ec0
Merge branch 'main' into dispatch_add_v0130
xin2an Jan 19, 2026
575621d
Delete unnecessary files.
xin2an Jan 19, 2026
9e85536
Modify copyright information
xin2an Jan 19, 2026
676cdd2
Modify the directory name
xin2an Jan 19, 2026
ab5f772
Make modifications and adjustments based on the PR (pull request) fee…
xin2an Jan 20, 2026
b5cc8d7
Adjust the code.
xin2an Jan 20, 2026
635c768
Place the attention mechanism in the `dispatch` directory.
xin2an Jan 20, 2026
3a086d8
Fixed bugs, added functionality to read configuration files in dispatch.
xin2an Jan 23, 2026
ec58589
Merge branch 'main' into dispatch_add_v0130
xin2an Jan 23, 2026
a02a3a0
Modify the code based on PR feedback.
xin2an Jan 24, 2026
7322bfe
Cancel the use of attention_backend in flagems
xin2an Jan 24, 2026
896a968
remove chinese
xin2an Jan 25, 2026
d9cbf15
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Jan 27, 2026
a8a736b
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Jan 29, 2026
de3a6b7
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Jan 30, 2026
1d090d2
Enable FlagGems attention backend with CUDA availability check
xin2an Feb 3, 2026
27d5405
Merge branch 'main' into dispatch_add_v0130
xin2an Feb 3, 2026
3dd3242
[New Feature] Add platform-specific operator config for Ascend/CUDA
xin2an Feb 4, 2026
97f6e39
Update copyright year from 2025 to 2026
xin2an Feb 4, 2026
ff51f4c
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Feb 5, 2026
0fb006a
flaggems_blacklist > flagos_blacklist, add utils.py file
xin2an Feb 5, 2026
57f596d
Delete TRITON_ATTN
xin2an Feb 5, 2026
965702d
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Feb 5, 2026
bbb4673
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Feb 10, 2026
511c8c5
Fix CUDA backend vendor detection to exclude CUDA-alike devices (MACA…
xin2an Feb 10, 2026
2b1bfe9
Fix attention_backend bug
xin2an Feb 10, 2026
cdf2443
Modify CUDA backend vendor detection and add PTG configuration.
xin2an Feb 10, 2026
c759f50
Add metax backend
xin2an Feb 10, 2026
467df01
delete auto_register.py
xin2an Feb 10, 2026
a15dd97
Revised according to feedback
xin2an Feb 11, 2026
415eb15
instance > obj
xin2an Feb 11, 2026
45da20c
Revised according to feedback
xin2an Feb 11, 2026
0037d27
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Feb 11, 2026
23137a3
The VLLM_FL_PLATFORM environment variable was deleted. The default co…
xin2an Feb 11, 2026
a7fb4c4
Decouple op implementations from Backend classes and add dispatch_met…
xin2an Feb 11, 2026
06d3adb
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Feb 12, 2026
bae4991
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Feb 24, 2026
90fb24e
Prevent CUDA detection for 'iluvatar' vendor
xin2an Feb 28, 2026
1167515
The Ascend platform adds a new Flagos blacklist operator.
xin2an Feb 28, 2026
6548741
Merge branch 'dispatch_add_v0130' of https://github.com/xin2an/vllm-p…
xin2an Feb 28, 2026
fe0ed5b
Merge branch 'flagos-ai:main' into dispatch_add_v0130
xin2an Mar 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 4 additions & 11 deletions vllm_fl/dispatch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,14 +227,11 @@ The system automatically detects hardware and loads the corresponding configurat

| Platform | Config File | Auto-Detection |
|----------|-------------|----------------|
| Ascend NPU | `config/ascend.yaml` | `torch.npu.is_available()` |
| NVIDIA GPU | `config/cuda.yaml` | `torch.cuda.is_available()` |
| Ascend NPU | `config/ascend.yaml` | `platform.vendor_name == 'ascend'` |
| NVIDIA GPU | `config/nvidia.yaml` | `platform.vendor_name == 'nvidia'` |
| METAX GPU | `config/metax.yaml` | `platform.vendor_name == 'metax'` |

You can force a specific platform using `VLLM_FL_PLATFORM` environment variable:
```bash
export VLLM_FL_PLATFORM=ascend # Force Ascend config
export VLLM_FL_PLATFORM=cuda # Force CUDA config
```
Platform detection is automatic based on `current_platform.vendor_name`.

### User-Specified Configuration File (YAML)

Expand Down Expand Up @@ -314,7 +311,6 @@ Environment variables can override specific items from platform config. If not s
|----------|---------|-------------|
| `VLLM_FL_PREFER_ENABLED` | `true` | Global switch. Set `false` to disable all dispatch features |
| `VLLM_FL_CONFIG` | (none) | Path to YAML config file (complete override) |
| `VLLM_FL_PLATFORM` | (auto) | Force platform: `ascend`, `cuda` |

#### Backend Selection

Expand Down Expand Up @@ -388,9 +384,6 @@ export VLLM_FL_PER_OP="rms_norm=vendor|flagos|reference"
# Use completely custom config file
export VLLM_FL_CONFIG=/path/to/my_config.yaml

# Force specific platform
export VLLM_FL_PLATFORM=ascend

# Enable debug logging
export VLLM_FL_LOG_LEVEL=DEBUG
```
Expand Down
14 changes: 14 additions & 0 deletions vllm_fl/dispatch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
)
from .manager import OpManager, get_default_manager, reset_default_manager
from .ops import VLLMFLBackendBase
from .method_dispatch import dispatch_method
from .discovery import (
discover_plugins,
get_discovered_plugins,
Expand All @@ -106,6 +107,16 @@
from .logger_manager import get_logger, set_log_level


def call_method_op(op_name: str, instance, *args, **kwargs):
"""
Call an operator as a bound method on *instance*.

The resolved backend function receives *instance* as ``self``,
allowing it to freely access instance attributes.
"""
return get_default_manager().call_as_method(op_name, instance, *args, **kwargs)


def call_op(op_name: str, *args, **kwargs):
"""
Convenience function to call an operator through the default manager.
Expand Down Expand Up @@ -163,6 +174,9 @@ def resolve_op(op_name: str):
"reset_default_manager",
# Backend base
"VLLMFLBackendBase",
# Method dispatch
"dispatch_method",
"call_method_op",
# Plugin discovery
"discover_plugins",
"get_discovered_plugins",
Expand Down
78 changes: 1 addition & 77 deletions vllm_fl/dispatch/backends/flaggems/flaggems.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

from __future__ import annotations

from typing import Optional, Union
from typing import Optional

import torch

Expand Down Expand Up @@ -42,82 +42,6 @@ def is_available(self) -> bool:

# ==================== Operator Implementations ====================

def silu_and_mul(self, obj, x: torch.Tensor) -> torch.Tensor:
"""
SiLU activation followed by element-wise multiplication.

Args:
obj: The calling obj (for interface consistency)
x: Input tensor of shape [..., 2*d]

Returns:
Output tensor of shape [..., d]
"""
from .impl.activation import silu_and_mul_flaggems

return silu_and_mul_flaggems(obj, x)

def rms_norm(
self,
obj,
x: torch.Tensor,
residual: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
"""
RMS normalization.

Args:
obj: The calling obj (e.g., RMSNorm layer)
x: Input tensor
residual: Optional residual tensor

Returns:
Normalized tensor, or tuple of (normalized, residual) if residual is provided
"""
from .impl.normalization import rms_norm_flaggems

return rms_norm_flaggems(obj, x, residual)

def rotary_embedding(
self,
obj,
query: torch.Tensor,
key: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
position_ids: torch.Tensor,
rotary_interleaved: bool = False,
inplace: bool = True,
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Apply rotary position embedding.

Args:
obj: The calling obj (for interface consistency)
query: Query tensor
key: Key tensor
cos: Cosine cache
sin: Sine cache
position_ids: Position indices
rotary_interleaved: Whether to use interleaved rotary
inplace: Whether to modify tensors in-place

Returns:
Tuple of (embedded_query, embedded_key)
"""
from .impl.rotary import rotary_embedding_flaggems

return rotary_embedding_flaggems(
obj,
query,
key,
cos,
sin,
position_ids,
rotary_interleaved=rotary_interleaved,
inplace=inplace,
)

def attention_backend(self, use_mla: bool = False) -> str:
"""
Get the attention backend class path for FlagGems.
Expand Down
4 changes: 2 additions & 2 deletions vllm_fl/dispatch/backends/flaggems/impl/activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@
import torch


def silu_and_mul_flaggems(obj, x: torch.Tensor) -> torch.Tensor:
def silu_and_mul_flaggems(self, x: torch.Tensor) -> torch.Tensor:
"""
SiLU activation followed by element-wise multiplication using FlagGems.

Args:
obj: The calling obj (for interface consistency)
self: The calling instance (for interface consistency)
x: Input tensor of shape [..., 2*d]

Returns:
Expand Down
10 changes: 5 additions & 5 deletions vllm_fl/dispatch/backends/flaggems/impl/normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@


def rms_norm_flaggems(
obj,
self,
x: torch.Tensor,
residual: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
"""
RMS normalization using FlagGems.

Args:
obj: The calling obj (e.g., RMSNorm layer)
self: The calling instance (e.g., RMSNorm layer)
x: Input tensor
residual: Optional residual tensor

Expand All @@ -29,8 +29,8 @@ def rms_norm_flaggems(
"""
from flag_gems.modules.normalization import gems_rms_forward

# Get weight and epsilon from obj
weight = obj.weight
epsilon = obj.variance_epsilon
# Get weight and epsilon from self
weight = self.weight
epsilon = self.variance_epsilon

return gems_rms_forward(x, residual, weight, epsilon)
4 changes: 2 additions & 2 deletions vllm_fl/dispatch/backends/flaggems/impl/rotary.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@


def rotary_embedding_flaggems(
obj,
self,
query: torch.Tensor,
key: torch.Tensor,
cos: torch.Tensor,
Expand All @@ -23,7 +23,7 @@ def rotary_embedding_flaggems(
Apply rotary position embedding using FlagGems.

Args:
obj: The calling obj (for interface consistency)
self: The calling instance (for interface consistency)
query: Query tensor
key: Key tensor
cos: Cosine cache
Expand Down
11 changes: 7 additions & 4 deletions vllm_fl/dispatch/backends/flaggems/register_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ def register_builtins(registry) -> None:
registry: Registry to register into
"""
from .flaggems import FlagGemsBackend
from .impl.activation import silu_and_mul_flaggems
from .impl.normalization import rms_norm_flaggems
from .impl.rotary import rotary_embedding_flaggems

backend = FlagGemsBackend()
is_avail = backend.is_available
Expand All @@ -44,7 +47,7 @@ def register_builtins(registry) -> None:
op_name="silu_and_mul",
impl_id="default.flagos",
kind=BackendImplKind.DEFAULT,
fn=_bind_is_available(backend.silu_and_mul, is_avail),
fn=_bind_is_available(silu_and_mul_flaggems, is_avail),
vendor=None,
priority=BackendPriority.DEFAULT,
),
Expand All @@ -53,7 +56,7 @@ def register_builtins(registry) -> None:
op_name="rms_norm",
impl_id="default.flagos",
kind=BackendImplKind.DEFAULT,
fn=_bind_is_available(backend.rms_norm, is_avail),
fn=_bind_is_available(rms_norm_flaggems, is_avail),
vendor=None,
priority=BackendPriority.DEFAULT,
),
Expand All @@ -62,11 +65,11 @@ def register_builtins(registry) -> None:
op_name="rotary_embedding",
impl_id="default.flagos",
kind=BackendImplKind.DEFAULT,
fn=_bind_is_available(backend.rotary_embedding, is_avail),
fn=_bind_is_available(rotary_embedding_flaggems, is_avail),
vendor=None,
priority=BackendPriority.DEFAULT,
),
# Attention Backend
# Attention Backend (no instance binding needed)
OpImpl(
op_name="attention_backend",
impl_id="default.flagos",
Expand Down
4 changes: 2 additions & 2 deletions vllm_fl/dispatch/backends/reference/impl/activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@
import torch.nn.functional as F


def silu_and_mul_torch(obj, x: torch.Tensor) -> torch.Tensor:
def silu_and_mul_torch(self, x: torch.Tensor) -> torch.Tensor:
"""
SiLU activation followed by element-wise multiplication using PyTorch.

Args:
obj: The calling obj (for interface consistency)
self: The calling instance (for interface consistency)
x: Input tensor of shape [..., 2*d]

Returns:
Expand Down
10 changes: 5 additions & 5 deletions vllm_fl/dispatch/backends/reference/impl/normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,24 @@


def rms_norm_torch(
obj,
self,
x: torch.Tensor,
residual: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
"""
RMS normalization using PyTorch.

Args:
obj: The calling obj (e.g., RMSNorm layer)
self: The calling instance (e.g., RMSNorm layer)
x: Input tensor
residual: Optional residual tensor

Returns:
Normalized tensor, or tuple of (normalized, residual) if residual is provided
"""
# Get weight and epsilon from obj
weight = obj.weight
epsilon = obj.variance_epsilon
# Get weight and epsilon from self
weight = self.weight
epsilon = self.variance_epsilon

if residual is not None:
x = x + residual
Expand Down
4 changes: 2 additions & 2 deletions vllm_fl/dispatch/backends/reference/impl/rotary.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@


def rotary_embedding_torch(
obj,
self,
query: torch.Tensor,
key: torch.Tensor,
cos: torch.Tensor,
Expand All @@ -23,7 +23,7 @@ def rotary_embedding_torch(
Apply rotary position embedding using PyTorch.

Args:
obj: The calling obj (for interface consistency)
self: The calling instance (for interface consistency)
query: Query tensor [batch, num_heads, seq_len, head_dim] or [seq_len, num_heads, head_dim]
key: Key tensor [batch, num_heads, seq_len, head_dim] or [seq_len, num_heads, head_dim]
cos: Cosine cache [max_seq_len, rotary_dim] where rotary_dim = head_dim or head_dim // 2
Expand Down
Loading
Loading