-
Notifications
You must be signed in to change notification settings - Fork 9
feat(enflame): add GCU platform, engines, and runtime shims for verl 0.9 #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
a7a2da1
3183344
3d575ec
c9132aa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Enflame GCU User Guide | ||
|
|
||
| Last updated: 06/22/2026. | ||
|
|
||
| ## Introduction | ||
|
|
||
| This document describes how to use verl for reinforcement learning training on Enflame GCU accelerators via `torch_gcu` and ECCL/FlagCX communication. | ||
|
|
||
| ## Platform Summary | ||
|
|
||
| | Item | Description | | ||
| |------|-------------| | ||
| | Device type | `enflame` | | ||
| | Vendor identifier | `enflame` | | ||
| | PyTorch API | `torch.gcu` (via `torch_gcu`) | | ||
| | Communication backend | `eccl` (default) or `flagcx` (when `USE_FLAGCX=1`) | | ||
| | Device visibility env var | `TOPS_VISIBLE_DEVICES` | | ||
| | Ray resource name | `GPU` (built-in) | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added end-to-end validation coverage in #5, including E2E checks. Please follow the scripts: https://github.com/verl-project/verl-hardware-plugin/blob/main/scripts/baseline_grpo_gsm8k.sh and compare the result in the https://swanlab.cn/@heavyrain/verl_grpo_gsm8k_math/runs/8h196r8o/chart |
||
| | IPC support | No (use device tensor path for weight transfer; Python SHM unsupported) | | ||
|
|
||
| ## Environment Variables | ||
|
|
||
| ```bash | ||
| export VERL_PLATFORM=enflame | ||
| export TOPS_VISIBLE_DEVICES=0,1,2,3 | ||
| export RAY_EXPERIMENTAL_NOSET_TOPS_VISIBLE_DEVICES=1 | ||
| export USE_FLAGCX=0 # use ECCL on homogenous ENFLAME cluster | ||
| ``` | ||
|
|
||
| When using **verl (upstream) + verl_hardware_plugin** (not verl-FL built-in platform), | ||
| Ray workers do not inherit shell exports. Pass these through Hydra / `ray_init.runtime_env`: | ||
|
|
||
| ```bash | ||
| +ray_kwargs.ray_init.runtime_env.env_vars.VERL_PLATFORM='enflame' | ||
| +ray_kwargs.ray_init.runtime_env.env_vars.VERL_USE_EXTERNAL_MODULES='verl_hardware_plugin' | ||
| +ray_kwargs.ray_init.runtime_env.env_vars.RAY_EXPERIMENTAL_NOSET_TOPS_VISIBLE_DEVICES='1' | ||
| ``` | ||
|
|
||
| Verify before training: | ||
|
|
||
| ```bash | ||
| python -c "import verl_hardware_plugin; import verl; from verl.plugin.platform import get_platform; print(get_platform().device_name)" | ||
| # Expected: enflame (not cpu) | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - `torch_gcu` may patch `torch.cuda.is_available()`; platform auto-detection probes `torch.gcu` before CUDA. | ||
| - FlagCX Stream compatibility is handled in `PlatformENFLAME.ensure_initialized()`. | ||
| - For Migration-based runtime patches, install the Migration package before importing verl. | ||
|
|
||
| ## Related Documentation | ||
|
|
||
| - [FlagOS User Guide](../user_guide_flagos/README.md) | ||
| - [Development Guide](../development.md) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,6 +32,58 @@ | |
| logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN")) | ||
|
|
||
|
|
||
| def enflame_fsdp_engine_registered() -> bool: | ||
| """Return True when Enflame FSDP engines are present on the active EngineRegistry.""" | ||
| try: | ||
| from verl.workers.engine.base import EngineRegistry | ||
|
|
||
| registry = EngineRegistry._engines.get("language_model", {}).get("fsdp", {}) | ||
| return ("gcu", "enflame") in registry or ("enflame", "enflame") in registry | ||
| except Exception: | ||
| return False | ||
|
|
||
|
|
||
| def ensure_enflame_engines_registered() -> None: | ||
| """Register Enflame engines on the current EngineRegistry if missing. | ||
|
|
||
| Migration may reload ``verl.workers.engine.base`` after the plugin first | ||
| imported, which clears ``EngineRegistry._engines``. Re-import engine modules | ||
| when lookup keys are absent. | ||
| """ | ||
| enflame_required = os.getenv("VERL_PLATFORM", "").strip().lower() == "enflame" | ||
|
|
||
| if not enflame_fsdp_engine_registered(): | ||
| try: | ||
| from verl_hardware_plugin.engines import fsdp_enflame # noqa: F401 | ||
|
|
||
| logger.info("Registered engines: fsdp_enflame") | ||
| except Exception as e: | ||
| if enflame_required: | ||
| logger.error("Failed to register Enflame FSDP engines (required): %s", e) | ||
| raise | ||
| logger.debug("ENFLAME FSDP engines not registered: %s", e) | ||
|
|
||
| try: | ||
| from verl.workers.engine.base import EngineRegistry | ||
|
|
||
| megatron_registry = EngineRegistry._engines.get("language_model", {}).get("megatron", {}) | ||
| if ("gcu", "enflame") not in megatron_registry and ("enflame", "enflame") not in megatron_registry: | ||
| from verl_hardware_plugin.engines import megatron_enflame # noqa: F401 | ||
|
|
||
| logger.info("Registered engines: megatron_enflame") | ||
| except Exception as e: | ||
| if enflame_required: | ||
| logger.error("Failed to register Enflame Megatron engines (required): %s", e) | ||
| raise | ||
| logger.debug("ENFLAME Megatron engines not registered: %s", e) | ||
|
|
||
| if enflame_required and not enflame_fsdp_engine_registered(): | ||
| raise RuntimeError( | ||
| "Enflame FSDP engine is not registered after ensure_enflame_engines_registered(). " | ||
| "Set VERL_LOGGING_LEVEL=DEBUG and check fsdp_enflame import errors." | ||
| ) | ||
|
|
||
|
|
||
| def register_all_engines(): | ||
| """Import all engine modules to trigger their @register decorators. | ||
|
|
||
|
|
@@ -105,3 +157,7 @@ def register_all_engines(): | |
| logger.info("Registered engines: megatron_metax") | ||
| except Exception as e: | ||
| logger.debug("MetaX Megatron engines not registered: %s", e) | ||
|
|
||
| # Enflame GCU engines (ECCL/FlagCX communication) | ||
| ensure_enflame_engines_registered() | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not required. |
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| # Copyright (c) 2026 BAAI. All rights reserved. | ||
| # Licensed under the Apache License, Version 2.0. | ||
|
|
||
| """FSDP engine for Enflame GCU devices.""" | ||
|
|
||
| import logging | ||
| import os | ||
|
|
||
| from verl.trainer.config import CheckpointConfig | ||
| from verl.workers.config import FSDPEngineConfig, FSDPOptimizerConfig, HFModelConfig | ||
| from verl.workers.engine.base import EngineRegistry | ||
| from verl.workers.engine.fsdp import FSDPEngineWithLMHead | ||
| from verl.workers.engine.fsdp.transformer_impl import FSDPEngineWithValueHead | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN")) | ||
|
|
||
|
|
||
| @EngineRegistry.register(model_type="language_model", backend=["fsdp", "fsdp2"], device="gcu", vendor="enflame") | ||
| class FSDPEnflameEngineWithLMHead(FSDPEngineWithLMHead): | ||
| """FSDP Engine for Enflame GCU with ECCL/FlagCX communication backend.""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| model_config: HFModelConfig, | ||
| engine_config: FSDPEngineConfig, | ||
| optimizer_config: FSDPOptimizerConfig, | ||
| checkpoint_config: CheckpointConfig, | ||
| ): | ||
| super().__init__(model_config, engine_config, optimizer_config, checkpoint_config) | ||
| logger.info("FSDPEnflameEngineWithLMHead initialized") | ||
|
|
||
| def initialize(self): | ||
| super().initialize() | ||
| logger.info("FSDPEnflameEngineWithLMHead initialized for ENFLAME") | ||
|
|
||
|
|
||
| @EngineRegistry.register(model_type="value_model", backend=["fsdp", "fsdp2"], device="gcu", vendor="enflame") | ||
| class FSDPEnflameEngineWithValueHead(FSDPEngineWithValueHead): | ||
| """FSDP Engine for Enflame GCU value model training.""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| model_config: HFModelConfig, | ||
| engine_config: FSDPEngineConfig, | ||
| optimizer_config: FSDPOptimizerConfig, | ||
| checkpoint_config: CheckpointConfig, | ||
| ): | ||
| super().__init__(model_config, engine_config, optimizer_config, checkpoint_config) | ||
| logger.info("FSDPEnflameEngineWithValueHead initialized") | ||
|
|
||
| def initialize(self): | ||
| super().initialize() |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Copyright (c) 2026 BAAI. All rights reserved. | ||
| # Licensed under the Apache License, Version 2.0. | ||
|
|
||
| """Megatron engine for Enflame GCU devices.""" | ||
|
|
||
| import logging | ||
| import os | ||
|
|
||
| from verl.workers.engine.base import EngineRegistry | ||
| from verl.workers.engine.megatron.transformer_impl import MegatronEngineWithLMHead | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN")) | ||
|
|
||
|
|
||
| @EngineRegistry.register(model_type="language_model", backend="megatron", device="gcu", vendor="enflame") | ||
| class MegatronEnflameEngineWithLMHead(MegatronEngineWithLMHead): | ||
| """Megatron Engine for Enflame GCU with ECCL/FlagCX communication backend.""" | ||
|
|
||
| def initialize(self): | ||
| super().initialize() | ||
| logger.info("MegatronEnflameEngineWithLMHead initialized for ENFLAME") |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User guide is too simple. You can refer to #3. Ensure that users can start the training according to the instructions