verl-project · heavyrain-lzy · Jun 29, 2026 · Jun 24, 2026 · Jun 26, 2026 · Jun 27, 2026
@@ -21,7 +21,7 @@ The platforms and engines in this repository are **reference implementations**
 | FlagOS | NVIDIA GPU (verified) | FlagCX / NCCL | ✅ Supported | [User Guide](docs/user_guide_flagos/nvidia/README.md) |
 | Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD |
 | Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) |
-| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD |
+| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
-| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
+| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) |
-| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
+| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) |
 | Huawei NPU | Ascend 910B | HCCL | Built-in (verl core) | [Ascend Tutorial](https://github.com/verl-project/verl/tree/main/docs/ascend_tutorial) |
 
 

@@ -28,7 +28,7 @@ user_guide_metax/
 |------|-------------|
 | Device type | `cuda` (CUDA-compatible) |
 | Vendor identifier | `metax` |
-| Communication backend | `nccl` |
+| Communication backend | `mccl` |
 | Device visibility env var | `CUDA_VISIBLE_DEVICES` |
 | Ray resource name | `GPU` |
 | IPC support | Yes |

@@ -0,0 +1,96 @@
+# MetaX FAQ and Troubleshooting
+
+## Common Questions
+
+### Q: How do I select the MetaX platform?
+
+Set the environment variable:
+
+```bash
+export VERL_PLATFORM=metax
+```
+
+Or let auto-detection handle it — ensure `mx-smi` is available, and verl will detect MetaX hardware automatically.
+
+### Q: What is the difference between `metax` and `nvidia` in CUDA-compatible mode?
+
+Both MetaX and NVIDIA use `torch.cuda` underneath (device name: `cuda`). The `vendor_name` distinguishes them:
+- `metax` → MetaX GPU hardware (detected via `mx-smi`)
+- `nvidia` → NVIDIA GPU hardware (detected via `nvidia-smi`)
+
+The vendor distinction ensures the correct platform-specific engine is selected during training.
+
+### Q: How does verl distinguish MetaX from NVIDIA?
+
+During auto-detection, verl runs the SMI command (`mx-smi` for MetaX, `nvidia-smi` for NVIDIA). Since both are CUDA-compatible and `torch.cuda.is_available()` returns `True` on both, the SMI check is the only reliable way to distinguish them.
+
+You can bypass auto-detection by setting `VERL_PLATFORM=metax` explicitly.
+
+### Q: What communication backend does MetaX use?
+
+MetaX uses  **MCCL** for distributed communication. 
+
+### Q: Does MetaX support FSDP and Megatron training?
+
+Yes. Both FSDP and Megatron engines are registered for MetaX:
+
+- **FSDP**: `FSDPMetaXEngineWithLMHead` / `FSDPMetaXEngineWithValueHead` (backend: `fsdp`, `fsdp2`)
+- **Megatron**: `MegatronMetaXEngineWithLMHead` (backend: `megatron`)
+
+---
+
+## Troubleshooting
+
+### CUDA out of memory during training
+
+- Enable parameter offload and optimizer offload in FSDP config:
+  ```
+  actor_rollout_ref.actor.fsdp_config.param_offload=True
+  actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
+  ```
+- Reduce `ppo_max_token_len_per_gpu` or `train_batch_size`.
+- Lower `rollout_gpu_mem_util` to reserve more memory for training.
+
+
+### Ray device detection issues in Docker
+
+If Ray cannot detect devices inside a `--privileged` Docker container:
+
+```bash
+export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
+```
+
+This disables the override that prevents device detection when the visibility environment variable resolves to zero devices.
+
+### Platform not detected during import
+
+If `verl_hardware_plugin` imports but MetaX is not registered, check the logs:
+
+```bash
+export VERL_LOGGING_LEVEL=DEBUG
+python3 -c "import verl_hardware_plugin"
+```
+
+Look for messages like:
+- `Registered platform: metax (cuda)` — Platform registered successfully
+- `MetaX platform not registered: <error>` — Import failed, check the error message
+
+### Verifying MetaX GPU accessibility
+
+```bash
+# Check torch CUDA access
+python3 -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"
+
+# Check mx-smi
+mx-smi -L
+
+# Check from within verl
+python3 -c "
+from verl.plugin.platform import get_platform
+p = get_platform()
+print(f'Platform: {p.vendor_name}')
+print(f'Device: {p.device_name}')
+print(f'Devices: {p.device_count()}')
+print(f'Available: {p.is_available()}')
+"
+```
@@ -0,0 +1,108 @@
+## Prerequisites
+
+- MetaX GPU hardware
+- Docker environment
+- Network access to pull images and download models
+
+## 1. Pull the Base Image
+
+Visit the MetaX Docker Hub page  (e.g., https://developer.metax-tech.com/softnova/docker?chip_name=%E6%9B%A6%E4%BA%91C500%E7%B3%BB%E5%88%97&package_name=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64)
+Copy the docker pull command and run it in your terminal.
+
+Start a container (e.g., verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64):
+
+```bash
+docker_image=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64
+docker_name=verl_test
+sudo docker run -itd \
+    --name ${docker_name} \
+    --net=host \
+    --uts=host \
+    --ipc=host \
+    --privileged=true \
+    --group-add video \
+    --shm-size 100gb \
+    --ulimit memlock=-1 \
+    --security-opt seccomp=unconfined \
+    --security-opt apparmor=unconfined \
+    --device=/dev/dri \
+    --device=/dev/mxcd \
+    --device=/dev/infiniband \
+    ${docker_image} \
+    /bin/bash
+
+docker exec -it verl_test bash
+```
+
+> **Note:** `/dev/mxcd` is the MetaX compute device and `/dev/dri` provides GPU rendering access — both are required for MetaX GPU workloads. Ensure `mx-smi` is available inside the container for hardware auto-detection. Add `-v` mounts for your data and model directories as needed (e.g., `-v /data/share/:/data/share/`).
+
+## 2. Prepare Data and Models
+
+```bash
+cd /workspace
+
+# Download model (example: Qwen3-8B)
+modelscope download --model Qwen/Qwen3-8B --local_dir ./Qwen3-8B
+
+# Download dataset (example: GSM8K)
+mkdir gsm8k && cd gsm8k
+wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/train.parquet"
+wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/test.parquet"
+```
+
+## 3. Install verl and verl-hardware-plugin
+
+verl is the RL training framework. For detailed installation options, see: [verl Installation Guide](https://verl.readthedocs.io/en/latest/start/install.html).
+
+verl-hardware-plugin provides the MetaX hardware platform integration for verl. For detailed information, see: [verl-hardware-plugin](https://github.com/verl-project/verl-hardware-plugin).
+
+```bash
+cd /workspace
+git clone https://github.com/verl-project/verl verl_main
+cd verl_main
+pip install --no-build-isolation --no-dependencies -v -e .
+
+# Install verl-hardware-plugin
+git clone https://github.com/verl-project/verl-hardware-plugin.git
+cd verl-hardware-plugin
+pip install --no-build-isolation -v -e .
+```
+
+## 4. Platform Configuration
+
+Set the MetaX platform before launching training:
+
+```bash
+export VERL_PLATFORM=metax
+```
+
+Or let auto-detection handle it (requires `mx-smi` in PATH inside the container).
+
+## Verification
+
+After installation, verify the components are properly installed:
+
+```bash
+python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
+python3 -c "import vllm; print('vLLM OK')"
+python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
-python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
+python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
-python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
+python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
+python3 -c "import transformer_engine; print('TransformerEngine OK')"
+python3 -c "import megatron.core; print('Megatron-LM OK')"
+python3 -c "import verl; print('verl OK')"
+python3 -c "from verl.plugin.platform import get_platform;p = get_platform();print(f'device: {p.device_name}');print(f'vendor: {p.vendor_name}');print(f'available: {p.is_available()}')"
+```
+
+Verify MetaX hardware detection:
+
+```bash
+mx-smi -L                                         # List MetaX GPUs
+python3 -c "
+import subprocess, sys
+ret = subprocess.run(['mx-smi', '-L'], capture_output=True, text=True)
+if ret.returncode == 0:
+    print('MetaX platform detected')
+else:
+    print('MetaX platform not detected')
+    sys.exit(1)
+"
+```
@@ -0,0 +1,151 @@
+# MetaX Quick Start
+
+## Introduction
+This guide walks you through running a GRPO training job with verl on the MetaX platform. Make sure you have completed the [Installation Guide](./install_guidance.md) first.
+
+## Running a Training Script
+
+Save the following as `run_qwen3_0.6b_grpo_gsm8k_metax.sh`. You must adjust `DATA_DIR` and `MODEL_DIR` to your local paths.
+
+```bash
+#!/usr/bin/env bash
+# GRPO | Qwen3-0.6B | FSDP training | NVIDIA GPUs with FlagOS
+
+set -xeuo pipefail
+
+export PYTORCH_ENABLE_SAME_RANK_A100=1
+export SET_DEVICE_NUMA_PREFERRED=1
+export HYDRA_FULL_ERROR=1
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export MCPYTORCH_DISABLE_PRINT=1
+export MAX_JOBS=20
+unset PAGEABLE_MEMCPY_ASYNC
+unset PYTORCH_CUDA_ALLOC_CONF
+
+export MCCL_MAX_NCHANNELS=16
+export PYTHONUNBUFFERED=1
+export VERL_PLATFORM=metax
+export MACA_MPS_MODE=1
+
+########################### user-adjustable ###########################
+DEVICE=${DEVICE:-gpu}
+INFER_BACKEND=${INFER_BACKEND:-vllm}
+
+NNODES=${NNODES:-1}
+NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
+
+train_batch_size=${TRAIN_BATCH_SIZE:-64}
+ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE:-16}
+max_prompt_length=${MAX_PROMPT_LENGTH:-1024}
+max_response_length=${MAX_RESPONSE_LENGTH:-1024}
+ppo_max_token_len_per_gpu=${PPO_MAX_TOKEN_LEN_PER_GPU:-24576}
+
+actor_lr=${ACTOR_LR:-1e-6}
+kl_loss_coef=${KL_LOSS_COEF:-0.001}
+entropy_coeff=${ENTROPY_COEFF:-0}
+
+rollout_tp=${ROLLOUT_TP:-2}
+rollout_gpu_mem_util=${ROLLOUT_GPU_MEM_UTIL:-0.3}
+rollout_n=${ROLLOUT_N:-5}
+
+total_epochs=${TOTAL_EPOCHS:-15}
+save_freq=${SAVE_FREQ:-20}
+test_freq=${TEST_FREQ:-5}
+
+PROJECT_NAME=${PROJECT_NAME:-verl_grpo_gsm8k_math}
+EXPERIMENT_NAME=${EXPERIMENT_NAME:-qwen3_0.6b_grpo_${INFER_BACKEND}_fsdp_$(date +%Y%m%d_%H%M)}
+########################### end user-adjustable ###########################
+
+########################### parameter arrays ###########################
+# Modify these paths to your actual data/model locations
+DATA_DIR=/data1/dh/gsm8k
+MODEL_DIR=/data1/dh/Qwen3-0.6B
+
+n_trainer_devices=$NGPUS_PER_NODE
+
+DATA=(
+    algorithm.adv_estimator=grpo
+    algorithm.use_kl_in_reward=False
+    data.train_files="['$DATA_DIR/train.parquet']"
+    data.val_files="['$DATA_DIR/test.parquet']"
+    data.train_batch_size=${train_batch_size}
+    data.max_prompt_length=${max_prompt_length}
+    data.max_response_length=${max_response_length}
+    data.filter_overlong_prompts=True
+    data.truncation='error'
+)
+
+MODEL=(
+    actor_rollout_ref.model.path="$MODEL_DIR"
+    actor_rollout_ref.model.use_remove_padding=True
+    actor_rollout_ref.model.enable_gradient_checkpointing=True
+)
+
+ACTOR=(
+    actor_rollout_ref.actor.optim.lr=${actor_lr}
+    actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size}
+    actor_rollout_ref.actor.use_dynamic_bsz=True
+    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu}
+    actor_rollout_ref.actor.use_kl_loss=True
+    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl
+    actor_rollout_ref.actor.entropy_coeff=${entropy_coeff}
+    actor_rollout_ref.actor.fsdp_config.param_offload=True
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
+)
+
+ROLLOUT=(
+    actor_rollout_ref.rollout.name=${INFER_BACKEND}
+    actor_rollout_ref.rollout.tensor_model_parallel_size=${rollout_tp}
+    actor_rollout_ref.rollout.gpu_memory_utilization=${rollout_gpu_mem_util}
+    actor_rollout_ref.rollout.n=${rollout_n}
+    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True
+    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu}
+    +actor_rollout_ref.rollout.enable_sleep_mode=False
+    actor_rollout_ref.rollout.free_cache_engine=False
+)
+
+REF=(
+    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True
+    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu}
+    actor_rollout_ref.ref.fsdp_config.param_offload=True
+)
+
+TRAINER=(
+    trainer.balance_batch=True
+    trainer.logger=['console','swanlab'] \
+    trainer.project_name=${PROJECT_NAME}
+    trainer.experiment_name=${EXPERIMENT_NAME}
+    trainer.n_gpus_per_node=${n_trainer_devices}
+    trainer.nnodes=${NNODES}
+    trainer.save_freq=${save_freq}
+    trainer.test_freq=${test_freq}
+    trainer.total_epochs=${total_epochs}
+)
+
+HYDRA_FULL_ERROR=1
+########################### launch ###########################
+python3 -m verl.trainer.main_ppo \
+    "${DATA[@]}" \
+    "${MODEL[@]}" \
+    "${ACTOR[@]}" \
+    "${ROLLOUT[@]}" \
+    "${REF[@]}" \
+    "${TRAINER[@]}" \
+    "$@" \
+    2>&1 | tee -a "verl_demo.log"
+
+```
+Launch the training:
+
+```bash
+bash run_qwen3_0.6b_grpo_gsm8k_metax.sh
+```
+
+Training is running successfully if you see step-level progress output in the logs.
+
+**Training Log**: [verl_grpo_gsm8k_metax](https://swanlab.cn/@harward/verl_grpo_gsm8k_metax/runs/9xnvjmby/chart)
+
+## Next Steps
+
+- See the [FAQ](./faq.md) for troubleshooting common issues.