Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The platforms and engines in this repository are **reference implementations**
| FlagOS | NVIDIA GPU (verified) | FlagCX / NCCL | ✅ Supported | [User Guide](docs/user_guide_flagos/nvidia/README.md) |
| Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD |
| Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) |
| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD |
| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two issues on this line:

  1. Inconsistent Communication Backend: The table lists MCCL as the communication backend for MetaX, but both the FAQ (docs/user_guide_metax/faq.md line 31) and the platform implementation (platform_cuda_metax.py line 143) state that MetaX uses standard NCCL.
  2. Broken Link: The link [User Guide](docs/user_guide_metax/README.md) points to a non-existent README.md file in the docs/user_guide_metax/ directory. It should point to docs/user_guide_metax/install_guidance.md instead.
Suggested change
| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) |

| Huawei NPU | Ascend 910B | HCCL | Built-in (verl core) | [Ascend Tutorial](https://github.com/verl-project/verl/tree/main/docs/ascend_tutorial) |


Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide_metax/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ user_guide_metax/
|------|-------------|
| Device type | `cuda` (CUDA-compatible) |
| Vendor identifier | `metax` |
| Communication backend | `nccl` |
| Communication backend | `mccl` |
| Device visibility env var | `CUDA_VISIBLE_DEVICES` |
| Ray resource name | `GPU` |
| IPC support | Yes |
Expand Down
96 changes: 96 additions & 0 deletions docs/user_guide_metax/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# MetaX FAQ and Troubleshooting

## Common Questions

### Q: How do I select the MetaX platform?

Set the environment variable:

```bash
export VERL_PLATFORM=metax
```

Or let auto-detection handle it — ensure `mx-smi` is available, and verl will detect MetaX hardware automatically.

### Q: What is the difference between `metax` and `nvidia` in CUDA-compatible mode?

Both MetaX and NVIDIA use `torch.cuda` underneath (device name: `cuda`). The `vendor_name` distinguishes them:
- `metax` → MetaX GPU hardware (detected via `mx-smi`)
- `nvidia` → NVIDIA GPU hardware (detected via `nvidia-smi`)

The vendor distinction ensures the correct platform-specific engine is selected during training.

### Q: How does verl distinguish MetaX from NVIDIA?

During auto-detection, verl runs the SMI command (`mx-smi` for MetaX, `nvidia-smi` for NVIDIA). Since both are CUDA-compatible and `torch.cuda.is_available()` returns `True` on both, the SMI check is the only reliable way to distinguish them.

You can bypass auto-detection by setting `VERL_PLATFORM=metax` explicitly.

### Q: What communication backend does MetaX use?

MetaX uses **MCCL** for distributed communication.

### Q: Does MetaX support FSDP and Megatron training?

Yes. Both FSDP and Megatron engines are registered for MetaX:

- **FSDP**: `FSDPMetaXEngineWithLMHead` / `FSDPMetaXEngineWithValueHead` (backend: `fsdp`, `fsdp2`)
- **Megatron**: `MegatronMetaXEngineWithLMHead` (backend: `megatron`)

---

## Troubleshooting

### CUDA out of memory during training

- Enable parameter offload and optimizer offload in FSDP config:
```
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
```
- Reduce `ppo_max_token_len_per_gpu` or `train_batch_size`.
- Lower `rollout_gpu_mem_util` to reserve more memory for training.


### Ray device detection issues in Docker

If Ray cannot detect devices inside a `--privileged` Docker container:

```bash
export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
```

This disables the override that prevents device detection when the visibility environment variable resolves to zero devices.

### Platform not detected during import

If `verl_hardware_plugin` imports but MetaX is not registered, check the logs:

```bash
export VERL_LOGGING_LEVEL=DEBUG
python3 -c "import verl_hardware_plugin"
```

Look for messages like:
- `Registered platform: metax (cuda)` — Platform registered successfully
- `MetaX platform not registered: <error>` — Import failed, check the error message

### Verifying MetaX GPU accessibility

```bash
# Check torch CUDA access
python3 -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"

# Check mx-smi
mx-smi -L

# Check from within verl
python3 -c "
from verl.plugin.platform import get_platform
p = get_platform()
print(f'Platform: {p.vendor_name}')
print(f'Device: {p.device_name}')
print(f'Devices: {p.device_count()}')
print(f'Available: {p.is_available()}')
"
```
108 changes: 108 additions & 0 deletions docs/user_guide_metax/install_guidance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
## Prerequisites

- MetaX GPU hardware
- Docker environment
- Network access to pull images and download models

## 1. Pull the Base Image

Visit the MetaX Docker Hub page (e.g., https://developer.metax-tech.com/softnova/docker?chip_name=%E6%9B%A6%E4%BA%91C500%E7%B3%BB%E5%88%97&package_name=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64)
Copy the docker pull command and run it in your terminal.

Start a container (e.g., verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64):

```bash
docker_image=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64
docker_name=verl_test
sudo docker run -itd \
--name ${docker_name} \
--net=host \
--uts=host \
--ipc=host \
--privileged=true \
--group-add video \
--shm-size 100gb \
--ulimit memlock=-1 \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--device=/dev/dri \
--device=/dev/mxcd \
--device=/dev/infiniband \
${docker_image} \
/bin/bash

docker exec -it verl_test bash
```

> **Note:** `/dev/mxcd` is the MetaX compute device and `/dev/dri` provides GPU rendering access — both are required for MetaX GPU workloads. Ensure `mx-smi` is available inside the container for hardware auto-detection. Add `-v` mounts for your data and model directories as needed (e.g., `-v /data/share/:/data/share/`).

## 2. Prepare Data and Models

```bash
cd /workspace

# Download model (example: Qwen3-8B)
modelscope download --model Qwen/Qwen3-8B --local_dir ./Qwen3-8B

# Download dataset (example: GSM8K)
mkdir gsm8k && cd gsm8k
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/train.parquet"
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/test.parquet"
```

## 3. Install verl and verl-hardware-plugin

verl is the RL training framework. For detailed installation options, see: [verl Installation Guide](https://verl.readthedocs.io/en/latest/start/install.html).

verl-hardware-plugin provides the MetaX hardware platform integration for verl. For detailed information, see: [verl-hardware-plugin](https://github.com/verl-project/verl-hardware-plugin).

```bash
cd /workspace
git clone https://github.com/verl-project/verl verl_main
cd verl_main
pip install --no-build-isolation --no-dependencies -v -e .

# Install verl-hardware-plugin
git clone https://github.com/verl-project/verl-hardware-plugin.git
cd verl-hardware-plugin
pip install --no-build-isolation -v -e .
```

## 4. Platform Configuration

Set the MetaX platform before launching training:

```bash
export VERL_PLATFORM=metax
```

Or let auto-detection handle it (requires `mx-smi` in PATH inside the container).

## Verification

After installation, verify the components are properly installed:

```bash
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python3 -c "import vllm; print('vLLM OK')"
python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Execution Error: Running ray.init(address='auto') will attempt to connect to an existing Ray cluster. Since these instructions are for setting up a new container, no Ray cluster is running yet, and this command will fail. Using ray.init() instead will start a local Ray instance and correctly verify GPU detection.

Suggested change
python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"

python3 -c "import transformer_engine; print('TransformerEngine OK')"
python3 -c "import megatron.core; print('Megatron-LM OK')"
python3 -c "import verl; print('verl OK')"
python3 -c "from verl.plugin.platform import get_platform;p = get_platform();print(f'device: {p.device_name}');print(f'vendor: {p.vendor_name}');print(f'available: {p.is_available()}')"
```

Verify MetaX hardware detection:

```bash
mx-smi -L # List MetaX GPUs
python3 -c "
import subprocess, sys
ret = subprocess.run(['mx-smi', '-L'], capture_output=True, text=True)
if ret.returncode == 0:
print('MetaX platform detected')
else:
print('MetaX platform not detected')
sys.exit(1)
"
```
151 changes: 151 additions & 0 deletions docs/user_guide_metax/quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# MetaX Quick Start

## Introduction
This guide walks you through running a GRPO training job with verl on the MetaX platform. Make sure you have completed the [Installation Guide](./install_guidance.md) first.

## Running a Training Script

Save the following as `run_qwen3_0.6b_grpo_gsm8k_metax.sh`. You must adjust `DATA_DIR` and `MODEL_DIR` to your local paths.

```bash
#!/usr/bin/env bash
# GRPO | Qwen3-0.6B | FSDP training | NVIDIA GPUs with FlagOS

set -xeuo pipefail

export PYTORCH_ENABLE_SAME_RANK_A100=1
export SET_DEVICE_NUMA_PREFERRED=1
export HYDRA_FULL_ERROR=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export MCPYTORCH_DISABLE_PRINT=1
export MAX_JOBS=20
unset PAGEABLE_MEMCPY_ASYNC
unset PYTORCH_CUDA_ALLOC_CONF

export MCCL_MAX_NCHANNELS=16
export PYTHONUNBUFFERED=1
export VERL_PLATFORM=metax
export MACA_MPS_MODE=1

########################### user-adjustable ###########################
DEVICE=${DEVICE:-gpu}
INFER_BACKEND=${INFER_BACKEND:-vllm}

NNODES=${NNODES:-1}
NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}

train_batch_size=${TRAIN_BATCH_SIZE:-64}
ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE:-16}
max_prompt_length=${MAX_PROMPT_LENGTH:-1024}
max_response_length=${MAX_RESPONSE_LENGTH:-1024}
ppo_max_token_len_per_gpu=${PPO_MAX_TOKEN_LEN_PER_GPU:-24576}

actor_lr=${ACTOR_LR:-1e-6}
kl_loss_coef=${KL_LOSS_COEF:-0.001}
entropy_coeff=${ENTROPY_COEFF:-0}

rollout_tp=${ROLLOUT_TP:-2}
rollout_gpu_mem_util=${ROLLOUT_GPU_MEM_UTIL:-0.3}
rollout_n=${ROLLOUT_N:-5}

total_epochs=${TOTAL_EPOCHS:-15}
save_freq=${SAVE_FREQ:-20}
test_freq=${TEST_FREQ:-5}

PROJECT_NAME=${PROJECT_NAME:-verl_grpo_gsm8k_math}
EXPERIMENT_NAME=${EXPERIMENT_NAME:-qwen3_0.6b_grpo_${INFER_BACKEND}_fsdp_$(date +%Y%m%d_%H%M)}
########################### end user-adjustable ###########################

########################### parameter arrays ###########################
# Modify these paths to your actual data/model locations
DATA_DIR=/data1/dh/gsm8k
MODEL_DIR=/data1/dh/Qwen3-0.6B

n_trainer_devices=$NGPUS_PER_NODE

DATA=(
algorithm.adv_estimator=grpo
algorithm.use_kl_in_reward=False
data.train_files="['$DATA_DIR/train.parquet']"
data.val_files="['$DATA_DIR/test.parquet']"
data.train_batch_size=${train_batch_size}
data.max_prompt_length=${max_prompt_length}
data.max_response_length=${max_response_length}
data.filter_overlong_prompts=True
data.truncation='error'
)

MODEL=(
actor_rollout_ref.model.path="$MODEL_DIR"
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.model.enable_gradient_checkpointing=True
)

ACTOR=(
actor_rollout_ref.actor.optim.lr=${actor_lr}
actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size}
actor_rollout_ref.actor.use_dynamic_bsz=True
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu}
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=${entropy_coeff}
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
)

ROLLOUT=(
actor_rollout_ref.rollout.name=${INFER_BACKEND}
actor_rollout_ref.rollout.tensor_model_parallel_size=${rollout_tp}
actor_rollout_ref.rollout.gpu_memory_utilization=${rollout_gpu_mem_util}
actor_rollout_ref.rollout.n=${rollout_n}
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu}
+actor_rollout_ref.rollout.enable_sleep_mode=False
actor_rollout_ref.rollout.free_cache_engine=False
)

REF=(
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu}
actor_rollout_ref.ref.fsdp_config.param_offload=True
)

TRAINER=(
trainer.balance_batch=True
trainer.logger=['console','swanlab'] \
trainer.project_name=${PROJECT_NAME}
trainer.experiment_name=${EXPERIMENT_NAME}
trainer.n_gpus_per_node=${n_trainer_devices}
trainer.nnodes=${NNODES}
trainer.save_freq=${save_freq}
trainer.test_freq=${test_freq}
trainer.total_epochs=${total_epochs}
)

HYDRA_FULL_ERROR=1
########################### launch ###########################
python3 -m verl.trainer.main_ppo \
"${DATA[@]}" \
"${MODEL[@]}" \
"${ACTOR[@]}" \
"${ROLLOUT[@]}" \
"${REF[@]}" \
"${TRAINER[@]}" \
"$@" \
2>&1 | tee -a "verl_demo.log"

```
Launch the training:

```bash
bash run_qwen3_0.6b_grpo_gsm8k_metax.sh
```

Training is running successfully if you see step-level progress output in the logs.

**Training Log**: [verl_grpo_gsm8k_metax](https://swanlab.cn/@harward/verl_grpo_gsm8k_metax/runs/9xnvjmby/chart)

## Next Steps

- See the [FAQ](./faq.md) for troubleshooting common issues.
Loading