Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ The platforms and engines in this repository are **reference implementations**
| Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD |
| Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) |
| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD |
| Enflame GCU | GCU | ECCL / FlagCX | ✅ Example (requires vendor support) | [User Guide](docs/user_guide_enflame/README.md) |
| Huawei NPU | Ascend 910B | HCCL | Built-in (verl core) | [Ascend Tutorial](https://github.com/verl-project/verl/tree/main/docs/ascend_tutorial) |


Expand All @@ -47,11 +48,13 @@ verl-FL (main framework)
├── PlatformRegistry.register("intel") → PlatformXPU
├── PlatformRegistry.register("cambricon")→ PlatformMLU
├── PlatformRegistry.register("metax") → PlatformMetaX
├── PlatformRegistry.register("enflame") → PlatformENFLAME
├── PlatformRegistry.register("flagos") → PlatformFlagOS
├── EngineRegistry.register(device="xpu", vendor="intel")
├── EngineRegistry.register(device="mlu", vendor="cambricon")
├── EngineRegistry.register(device="cuda", vendor="metax")
├── EngineRegistry.register(device="enflame", vendor="enflame")
└── EngineRegistry.register(device="cuda", vendor="flagos")
```

Expand Down Expand Up @@ -83,6 +86,7 @@ Each hardware platform provides a standalone user guide (following the structure
- **[Cambricon MLU](docs/user_guide_mlu/README.md)** — Cambricon MLU370 / MLU590 user guide
- **[MetaX GPU](docs/user_guide_metax/README.md)** — MetaX GPU user guide
- **[FlagOS](docs/user_guide_flagos/README.md)** — FlagOS unified heterogeneous engine user guide ([NVIDIA](docs/user_guide_flagos/nvidia/README.md))
- **[Enflame GCU](docs/user_guide_enflame/README.md)** — Enflame GCU user guide

### Developer Guides

Expand Down
2 changes: 2 additions & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,7 @@ Existing reference implementations:
- `docs/user_guide_mlu/` — Cambricon MLU
- `docs/user_guide_metax/` — MetaX
- `docs/user_guide_flagos/` — FlagOS
- `docs/user_guide_enflame/` — Enflame GCU

> **Tip**: Refer to `verl/docs/ascend_tutorial` (Huawei NPU) for documentation quality and coverage expectations. That tutorial covers installation, quick start, advanced features, performance tuning, precision analysis, and FAQ.

Expand Down Expand Up @@ -898,6 +899,7 @@ The following files in this repository serve as examples:
| Intel XPU | `platforms/platform_xpu.py` | `engines/fsdp_xpu.py`, `engines/megatron_xpu.py` |
| Cambricon MLU | `platforms/platform_mlu.py` | `engines/fsdp_mlu.py`, `engines/megatron_mlu.py` |
| MetaX | `platforms/platform_cuda_metax.py` | `engines/fsdp_metax.py`, `engines/megatron_metax.py` |
| Enflame GCU | `platforms/platform_enflame.py` | `engines/fsdp_enflame.py`, `engines/megatron_enflame.py` |

---

Expand Down
55 changes: 55 additions & 0 deletions docs/user_guide_enflame/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Enflame GCU User Guide

Last updated: 06/22/2026.

@heavyrain-lzy heavyrain-lzy Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User guide is too simple. You can refer to #3. Ensure that users can start the training according to the instructions


## Introduction

This document describes how to use verl for reinforcement learning training on Enflame GCU accelerators via `torch_gcu` and ECCL/FlagCX communication.

## Platform Summary

| Item | Description |
|------|-------------|
| Device type | `enflame` |
| Vendor identifier | `enflame` |
| PyTorch API | `torch.gcu` (via `torch_gcu`) |
| Communication backend | `eccl` (default) or `flagcx` (when `USE_FLAGCX=1`) |
| Device visibility env var | `TOPS_VISIBLE_DEVICES` |
| Ray resource name | `GPU` (built-in) |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added end-to-end validation coverage in #5, including E2E checks. Please follow the scripts: https://github.com/verl-project/verl-hardware-plugin/blob/main/scripts/baseline_grpo_gsm8k.sh and compare the result in the https://swanlab.cn/@heavyrain/verl_grpo_gsm8k_math/runs/8h196r8o/chart

| IPC support | No (use device tensor path for weight transfer; Python SHM unsupported) |

## Environment Variables

```bash
export VERL_PLATFORM=enflame
export TOPS_VISIBLE_DEVICES=0,1,2,3
export RAY_EXPERIMENTAL_NOSET_TOPS_VISIBLE_DEVICES=1
export USE_FLAGCX=0 # use ECCL on homogenous ENFLAME cluster
```

When using **verl (upstream) + verl_hardware_plugin** (not verl-FL built-in platform),
Ray workers do not inherit shell exports. Pass these through Hydra / `ray_init.runtime_env`:

```bash
+ray_kwargs.ray_init.runtime_env.env_vars.VERL_PLATFORM='enflame'
+ray_kwargs.ray_init.runtime_env.env_vars.VERL_USE_EXTERNAL_MODULES='verl_hardware_plugin'
+ray_kwargs.ray_init.runtime_env.env_vars.RAY_EXPERIMENTAL_NOSET_TOPS_VISIBLE_DEVICES='1'
```

Verify before training:

```bash
python -c "import verl_hardware_plugin; import verl; from verl.plugin.platform import get_platform; print(get_platform().device_name)"
# Expected: enflame (not cpu)
```

## Notes

- `torch_gcu` may patch `torch.cuda.is_available()`; platform auto-detection probes `torch.gcu` before CUDA.
- FlagCX Stream compatibility is handled in `PlatformENFLAME.ensure_initialized()`.
- For Migration-based runtime patches, install the Migration package before importing verl.

## Related Documentation

- [FlagOS User Guide](../user_guide_flagos/README.md)
- [Development Guide](../development.md)
70 changes: 70 additions & 0 deletions tests/test_plugin_registration.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,56 @@ def test_mlu_detection_with_env(self):
with mock.patch.dict(os.environ, {"VERL_PLATFORM": "cambricon"}):
assert _detect_platform_name() == "cambricon"

def test_enflame_registered(self):
from verl.plugin.platform.platform_manager import PlatformRegistry
from verl_hardware_plugin.platforms.platform_enflame import PlatformENFLAME # noqa: F401

assert "enflame" in PlatformRegistry.registered_names()
cls = PlatformRegistry.get("enflame")
assert cls is PlatformENFLAME

def test_enflame_detection_with_env(self):
from verl.plugin.platform.platform_manager import _detect_platform_name
from verl_hardware_plugin.platforms.platform_enflame import PlatformENFLAME # noqa: F401

with _fresh_registries():
with mock.patch.dict(os.environ, {"VERL_PLATFORM": "enflame"}):
assert _detect_platform_name() == "enflame"

def test_enflame_device_and_vendor_names(self):
from verl_hardware_plugin.platforms.platform_enflame import PlatformENFLAME

platform = PlatformENFLAME()
assert platform.device_name == "gcu"
assert platform.vendor_name == "enflame"

def test_enflame_gcu_ipc_collect_shim(self):
from types import ModuleType
from unittest import mock

import verl_hardware_plugin.platforms.platform_enflame as platform_enflame

fake_gcu = ModuleType("gcu")
old_patched = platform_enflame._gcu_runtime_patched
try:
platform_enflame._gcu_runtime_patched = False
with mock.patch.object(platform_enflame, "_ensure_torch_gcu", return_value=True):
with mock.patch.object(platform_enflame.torch, "gcu", fake_gcu, create=True):
module = platform_enflame._get_gcu_module()
assert module is fake_gcu
assert callable(module.ipc_collect)
module.ipc_collect()
finally:
platform_enflame._gcu_runtime_patched = old_patched

def test_enflame_communication_backend(self):
from verl_hardware_plugin.platforms.platform_enflame import PlatformENFLAME

with mock.patch.dict(os.environ, {}, clear=True):
assert PlatformENFLAME().communication_backend_name() == "eccl"
with mock.patch.dict(os.environ, {"USE_FLAGCX": "1"}, clear=False):
assert PlatformENFLAME().communication_backend_name() == "flagcx"

def test_metax_detection_with_env(self):
from verl.plugin.platform.platform_manager import _detect_platform_name
from verl_hardware_plugin.platforms.platform_cuda_metax import PlatformMetaX # noqa: F401
Expand Down Expand Up @@ -148,6 +198,26 @@ def test_megatron_metax_engine_registered(self):
assert EngineRegistry._engines["language_model"]["megatron"][("cuda", "metax")] is MegatronMetaXEngineWithLMHead


def test_fsdp_enflame_engines_registered(self):
from verl.workers.engine.base import EngineRegistry
from verl_hardware_plugin.engines.fsdp_enflame import (
FSDPEnflameEngineWithLMHead,
FSDPEnflameEngineWithValueHead,
)

assert EngineRegistry._engines["language_model"]["fsdp"][("gcu", "enflame")] is FSDPEnflameEngineWithLMHead
assert EngineRegistry._engines["value_model"]["fsdp"][("gcu", "enflame")] is FSDPEnflameEngineWithValueHead

def test_megatron_enflame_engine_registered(self):
from verl.workers.engine.base import EngineRegistry
from verl_hardware_plugin.engines.megatron_enflame import MegatronEnflameEngineWithLMHead

assert (
EngineRegistry._engines["language_model"]["megatron"][("gcu", "enflame")]
is MegatronEnflameEngineWithLMHead
)


class TestFLEnvManager:
"""Test FLEnvManager utility."""

Expand Down
2 changes: 1 addition & 1 deletion verl_hardware_plugin/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

"""verl hardware plugin - Multi-chip platform and engine support.

This package registers hardware platforms (MetaX, XPU, MLU) and their
This package registers hardware platforms (MetaX, XPU, MLU, Enflame GCU) and their
corresponding training engines with verl's plugin system.

Discovered automatically via setuptools entry_points (verl.plugins group).
Expand Down
56 changes: 56 additions & 0 deletions verl_hardware_plugin/engines/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,58 @@
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))


def enflame_fsdp_engine_registered() -> bool:
"""Return True when Enflame FSDP engines are present on the active EngineRegistry."""
try:
from verl.workers.engine.base import EngineRegistry

registry = EngineRegistry._engines.get("language_model", {}).get("fsdp", {})
return ("gcu", "enflame") in registry or ("enflame", "enflame") in registry
except Exception:
return False


def ensure_enflame_engines_registered() -> None:
"""Register Enflame engines on the current EngineRegistry if missing.

Migration may reload ``verl.workers.engine.base`` after the plugin first
imported, which clears ``EngineRegistry._engines``. Re-import engine modules
when lookup keys are absent.
"""
enflame_required = os.getenv("VERL_PLATFORM", "").strip().lower() == "enflame"

if not enflame_fsdp_engine_registered():
try:
from verl_hardware_plugin.engines import fsdp_enflame # noqa: F401

logger.info("Registered engines: fsdp_enflame")
except Exception as e:
if enflame_required:
logger.error("Failed to register Enflame FSDP engines (required): %s", e)
raise
logger.debug("ENFLAME FSDP engines not registered: %s", e)

try:
from verl.workers.engine.base import EngineRegistry

megatron_registry = EngineRegistry._engines.get("language_model", {}).get("megatron", {})
if ("gcu", "enflame") not in megatron_registry and ("enflame", "enflame") not in megatron_registry:
from verl_hardware_plugin.engines import megatron_enflame # noqa: F401

logger.info("Registered engines: megatron_enflame")
except Exception as e:
if enflame_required:
logger.error("Failed to register Enflame Megatron engines (required): %s", e)
raise
logger.debug("ENFLAME Megatron engines not registered: %s", e)

if enflame_required and not enflame_fsdp_engine_registered():
raise RuntimeError(
"Enflame FSDP engine is not registered after ensure_enflame_engines_registered(). "
"Set VERL_LOGGING_LEVEL=DEBUG and check fsdp_enflame import errors."
)


def register_all_engines():
"""Import all engine modules to trigger their @register decorators.

Expand Down Expand Up @@ -105,3 +157,7 @@ def register_all_engines():
logger.info("Registered engines: megatron_metax")
except Exception as e:
logger.debug("MetaX Megatron engines not registered: %s", e)

# Enflame GCU engines (ECCL/FlagCX communication)
ensure_enflame_engines_registered()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required.


53 changes: 53 additions & 0 deletions verl_hardware_plugin/engines/fsdp_enflame.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright (c) 2026 BAAI. All rights reserved.
# Licensed under the Apache License, Version 2.0.

"""FSDP engine for Enflame GCU devices."""

import logging
import os

from verl.trainer.config import CheckpointConfig
from verl.workers.config import FSDPEngineConfig, FSDPOptimizerConfig, HFModelConfig
from verl.workers.engine.base import EngineRegistry
from verl.workers.engine.fsdp import FSDPEngineWithLMHead
from verl.workers.engine.fsdp.transformer_impl import FSDPEngineWithValueHead

logger = logging.getLogger(__name__)
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))


@EngineRegistry.register(model_type="language_model", backend=["fsdp", "fsdp2"], device="gcu", vendor="enflame")
class FSDPEnflameEngineWithLMHead(FSDPEngineWithLMHead):
"""FSDP Engine for Enflame GCU with ECCL/FlagCX communication backend."""

def __init__(
self,
model_config: HFModelConfig,
engine_config: FSDPEngineConfig,
optimizer_config: FSDPOptimizerConfig,
checkpoint_config: CheckpointConfig,
):
super().__init__(model_config, engine_config, optimizer_config, checkpoint_config)
logger.info("FSDPEnflameEngineWithLMHead initialized")

def initialize(self):
super().initialize()
logger.info("FSDPEnflameEngineWithLMHead initialized for ENFLAME")


@EngineRegistry.register(model_type="value_model", backend=["fsdp", "fsdp2"], device="gcu", vendor="enflame")
class FSDPEnflameEngineWithValueHead(FSDPEngineWithValueHead):
"""FSDP Engine for Enflame GCU value model training."""

def __init__(
self,
model_config: HFModelConfig,
engine_config: FSDPEngineConfig,
optimizer_config: FSDPOptimizerConfig,
checkpoint_config: CheckpointConfig,
):
super().__init__(model_config, engine_config, optimizer_config, checkpoint_config)
logger.info("FSDPEnflameEngineWithValueHead initialized")

def initialize(self):
super().initialize()
22 changes: 22 additions & 0 deletions verl_hardware_plugin/engines/megatron_enflame.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright (c) 2026 BAAI. All rights reserved.
# Licensed under the Apache License, Version 2.0.

"""Megatron engine for Enflame GCU devices."""

import logging
import os

from verl.workers.engine.base import EngineRegistry
from verl.workers.engine.megatron.transformer_impl import MegatronEngineWithLMHead

logger = logging.getLogger(__name__)
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))


@EngineRegistry.register(model_type="language_model", backend="megatron", device="gcu", vendor="enflame")
class MegatronEnflameEngineWithLMHead(MegatronEngineWithLMHead):
"""Megatron Engine for Enflame GCU with ECCL/FlagCX communication backend."""

def initialize(self):
super().initialize()
logger.info("MegatronEnflameEngineWithLMHead initialized for ENFLAME")
9 changes: 9 additions & 0 deletions verl_hardware_plugin/platforms/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,12 @@ def register_all_platforms():
logger.info("Registered platform: metax (cuda)")
except Exception as e:
logger.debug("MetaX platform not registered: %s", e)

# Enflame GCU — requires torch_gcu
try:
from verl_hardware_plugin.platforms import platform_enflame # noqa: F401

logger.info("Registered platform: enflame (gcu)")
except Exception as e:
logger.debug("ENFLAME platform not registered: %s", e)

Loading
Loading