verl-project · heavyrain-lzy · Jun 29, 2026 · Jun 24, 2026 · Jun 26, 2026 · Jun 27, 2026
@@ -21,7 +21,7 @@ The platforms and engines in this repository are **reference implementations**
 | FlagOS | NVIDIA GPU (verified) | FlagCX / NCCL | ✅ Supported | [User Guide](docs/user_guide_flagos/nvidia/README.md) |
 | Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD |
 | Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) |
-| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD |
+| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
-| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
+| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) |
-| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
+| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) |
 | Huawei NPU | Ascend 910B | HCCL | Built-in (verl core) | [Ascend Tutorial](https://github.com/verl-project/verl/tree/main/docs/ascend_tutorial) |
 
 

@@ -0,0 +1,96 @@
+# MetaX FAQ and Troubleshooting
+
+## Common Questions
+
+### Q: How do I select the MetaX platform?
+
+Set the environment variable:
+
+```bash
+export VERL_PLATFORM=metax
+```
+
+Or let auto-detection handle it — ensure `mx-smi` is available, and verl will detect MetaX hardware automatically.
+
+### Q: What is the difference between `metax` and `nvidia` in CUDA-compatible mode?
+
+Both MetaX and NVIDIA use `torch.cuda` underneath (device name: `cuda`). The `vendor_name` distinguishes them:
+- `metax` → MetaX GPU hardware (detected via `mx-smi`)
+- `nvidia` → NVIDIA GPU hardware (detected via `nvidia-smi`)
+
+The vendor distinction ensures the correct platform-specific engine is selected during training.
+
+### Q: How does verl distinguish MetaX from NVIDIA?
+
+During auto-detection, verl runs the SMI command (`mx-smi` for MetaX, `nvidia-smi` for NVIDIA). Since both are CUDA-compatible and `torch.cuda.is_available()` returns `True` on both, the SMI check is the only reliable way to distinguish them.
+
+You can bypass auto-detection by setting `VERL_PLATFORM=metax` explicitly.
+
+### Q: What communication backend does MetaX use?
+
+MetaX uses standard **NCCL** for distributed communication. No special communication library is required.
+
+### Q: Does MetaX support FSDP and Megatron training?
+
+Yes. Both FSDP and Megatron engines are registered for MetaX:
+
+- **FSDP**: `FSDPMetaXEngineWithLMHead` / `FSDPMetaXEngineWithValueHead` (backend: `fsdp`, `fsdp2`)
+- **Megatron**: `MegatronMetaXEngineWithLMHead` (backend: `megatron`)
+
+---
+
+## Troubleshooting
+
+### CUDA out of memory during training
+
+- Enable parameter offload and optimizer offload in FSDP config:
+  ```
+  actor_rollout_ref.actor.fsdp_config.param_offload=True
+  actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
+  ```
+- Reduce `ppo_max_token_len_per_gpu` or `train_batch_size`.
+- Lower `rollout_gpu_mem_util` to reserve more memory for training.
+
+
+### Ray device detection issues in Docker
+
+If Ray cannot detect devices inside a `--privileged` Docker container:
+
+```bash
+export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
+```
+
+This disables the override that prevents device detection when the visibility environment variable resolves to zero devices.
+
+### Platform not detected during import
+
+If `verl_hardware_plugin` imports but MetaX is not registered, check the logs:
+
+```bash
+export VERL_LOGGING_LEVEL=DEBUG
+python3 -c "import verl_hardware_plugin"
+```
+
+Look for messages like:
+- `Registered platform: metax (cuda)` — Platform registered successfully
+- `MetaX platform not registered: <error>` — Import failed, check the error message
+
+### Verifying MetaX GPU accessibility
+
+```bash
+# Check torch CUDA access
+python3 -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"
+
+# Check mx-smi
+mx-smi -L
+
+# Check from within verl
+python3 -c "
+from verl.plugin.platform import get_platform
+p = get_platform()
+print(f'Platform: {p.vendor_name}')
+print(f'Device: {p.device_name}')
+print(f'Devices: {p.device_count()}')
+print(f'Available: {p.is_available()}')
+"
+```
@@ -0,0 +1,107 @@
+## Prerequisites
+
+- MetaX GPU hardware
+- Docker environment
+- Network access to pull images and download models
+
+## 1. Pull the Base Image
+
+Please download docker images from (https://developer.metax-tech.com/) or contact Metax engineers to get the docker images.
+
+Start a container (MetaX example):
+
+```bash
+docker_image=<metaX-base-image>
+docker_name=verl_test
+sudo docker run -itd \
+    --name ${docker_name} \
+    --net=host \
+    --uts=host \
+    --ipc=host \
+    --privileged=true \
+    --group-add video \
+    --shm-size 100gb \
+    --ulimit memlock=-1 \
+    --security-opt seccomp=unconfined \
+    --security-opt apparmor=unconfined \
+    --device=/dev/dri \
+    --device=/dev/mxcd \
+    --device=/dev/infiniband \
+    ${docker_image} \
+    /bin/bash
+
+docker exec -it verl_test bash
+```
+
+> **Note:** `/dev/mxcd` is the MetaX compute device and `/dev/dri` provides GPU rendering access — both are required for MetaX GPU workloads. Ensure `mx-smi` is available inside the container for hardware auto-detection. Add `-v` mounts for your data and model directories as needed (e.g., `-v /data/share/:/data/share/`).
+
+## 2. Prepare Data and Models
+
+```bash
+cd /workspace
+
+# Download model (example: Qwen3-8B)
+modelscope download --model Qwen/Qwen3-8B --local_dir ./Qwen3-8B
+
+# Download dataset (example: GSM8K)
+mkdir gsm8k && cd gsm8k
+wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/train.parquet"
+wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/test.parquet"
+```
+
+## 3. Install verl and verl-hardware-plugin
+
+verl is the RL training framework. For detailed installation options, see: [verl Installation Guide](https://verl.readthedocs.io/en/latest/start/install.html).
+
+verl-hardware-plugin provides the MetaX hardware platform integration for verl. For detailed information, see: [verl-hardware-plugin](https://github.com/verl-project/verl-hardware-plugin).
+
+```bash
+cd /workspace
+git clone https://github.com/verl-project/verl
+cd verl
+pip install --no-build-isolation -v -e .
+
+# Install verl-hardware-plugin
+git clone https://github.com/verl-project/verl-hardware-plugin.git
+cd verl-hardware-plugin
+pip install --no-build-isolation -v -e .
+```
+
+## 4. Platform Configuration
+
+Set the MetaX platform before launching training:
+
+```bash
+export VERL_PLATFORM=metax
+```
+
+Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options.
-Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options.
+Or let auto-detection handle it (requires mx-smi in PATH inside the container).
-Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options.
+Or let auto-detection handle it (requires mx-smi in PATH inside the container).
+
+## Verification
+
+After installation, verify the components are properly installed:
+
+```bash
+python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
+python3 -c "import vllm; print('vLLM OK')"
+python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
-python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
+python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
-python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
+python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
+python3 -c "import transformer_engine; print('TransformerEngine OK')"
+python3 -c "import megatron.core; print('Megatron-LM OK')"
+python3 -c "import verl; print('verl OK')"
+python3 -c "from verl.plugin.platform import get_platform;p = get_platform();print(f'device: {p.device_name}');print(f'vendor: {p.vendor_name}');print(f'available: {p.is_available()}')"
+```
+
+Verify MetaX hardware detection:
+
+```bash
+mx-smi -L                                         # List MetaX GPUs
+python3 -c "
+import subprocess, sys
+ret = subprocess.run(['mx-smi', '-L'], capture_output=True, text=True)
+if ret.returncode == 0:
+    print('MetaX platform detected')
+else:
+    print('MetaX platform not detected')
+    sys.exit(1)
+"
+```
@@ -0,0 +1,23 @@
+# MetaX Quick Start
+
+## Introduction
+This guide walks you through running a GRPO training job with verl on the MetaX platform. Make sure you have completed the [Installation Guide](./install_guidance.md) first.
+
+## Running a Training Script
+
+Metax provides verl example training scripts under `/workspace/verl/examples`. 
+
+Launch the training (example for Qwen3-8B on 8 GPUs):
+
+```bash
+cd /workspace/verl/examples/grpo_trainer/
+bash train_qwen3-8b-8gpus.sh
+```
+
+Training is running successfully if you see step-level progress output in the logs.
+
+> Adjust environment variables in the script (model path, data path, batch size, etc.) to match your setup before running.
+
+## Next Steps
+
+- See the [FAQ](./faq.md) for troubleshooting common issues.