Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The platforms and engines in this repository are **reference implementations**
| FlagOS | NVIDIA GPU (verified) | FlagCX / NCCL | ✅ Supported | [User Guide](docs/user_guide_flagos/nvidia/README.md) |
| Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD |
| Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) |
| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD |
| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two issues on this line:

  1. Inconsistent Communication Backend: The table lists MCCL as the communication backend for MetaX, but both the FAQ (docs/user_guide_metax/faq.md line 31) and the platform implementation (platform_cuda_metax.py line 143) state that MetaX uses standard NCCL.
  2. Broken Link: The link [User Guide](docs/user_guide_metax/README.md) points to a non-existent README.md file in the docs/user_guide_metax/ directory. It should point to docs/user_guide_metax/install_guidance.md instead.
Suggested change
| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) |

| Huawei NPU | Ascend 910B | HCCL | Built-in (verl core) | [Ascend Tutorial](https://github.com/verl-project/verl/tree/main/docs/ascend_tutorial) |


Expand Down
96 changes: 96 additions & 0 deletions docs/user_guide_metax/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# MetaX FAQ and Troubleshooting

## Common Questions

### Q: How do I select the MetaX platform?

Set the environment variable:

```bash
export VERL_PLATFORM=metax
```

Or let auto-detection handle it — ensure `mx-smi` is available, and verl will detect MetaX hardware automatically.

### Q: What is the difference between `metax` and `nvidia` in CUDA-compatible mode?

Both MetaX and NVIDIA use `torch.cuda` underneath (device name: `cuda`). The `vendor_name` distinguishes them:
- `metax` → MetaX GPU hardware (detected via `mx-smi`)
- `nvidia` → NVIDIA GPU hardware (detected via `nvidia-smi`)

The vendor distinction ensures the correct platform-specific engine is selected during training.

### Q: How does verl distinguish MetaX from NVIDIA?

During auto-detection, verl runs the SMI command (`mx-smi` for MetaX, `nvidia-smi` for NVIDIA). Since both are CUDA-compatible and `torch.cuda.is_available()` returns `True` on both, the SMI check is the only reliable way to distinguish them.

You can bypass auto-detection by setting `VERL_PLATFORM=metax` explicitly.

### Q: What communication backend does MetaX use?

MetaX uses standard **NCCL** for distributed communication. No special communication library is required.

### Q: Does MetaX support FSDP and Megatron training?

Yes. Both FSDP and Megatron engines are registered for MetaX:

- **FSDP**: `FSDPMetaXEngineWithLMHead` / `FSDPMetaXEngineWithValueHead` (backend: `fsdp`, `fsdp2`)
- **Megatron**: `MegatronMetaXEngineWithLMHead` (backend: `megatron`)

---

## Troubleshooting

### CUDA out of memory during training

- Enable parameter offload and optimizer offload in FSDP config:
```
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
```
- Reduce `ppo_max_token_len_per_gpu` or `train_batch_size`.
- Lower `rollout_gpu_mem_util` to reserve more memory for training.


### Ray device detection issues in Docker

If Ray cannot detect devices inside a `--privileged` Docker container:

```bash
export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
```

This disables the override that prevents device detection when the visibility environment variable resolves to zero devices.

### Platform not detected during import

If `verl_hardware_plugin` imports but MetaX is not registered, check the logs:

```bash
export VERL_LOGGING_LEVEL=DEBUG
python3 -c "import verl_hardware_plugin"
```

Look for messages like:
- `Registered platform: metax (cuda)` — Platform registered successfully
- `MetaX platform not registered: <error>` — Import failed, check the error message

### Verifying MetaX GPU accessibility

```bash
# Check torch CUDA access
python3 -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"

# Check mx-smi
mx-smi -L

# Check from within verl
python3 -c "
from verl.plugin.platform import get_platform
p = get_platform()
print(f'Platform: {p.vendor_name}')
print(f'Device: {p.device_name}')
print(f'Devices: {p.device_count()}')
print(f'Available: {p.is_available()}')
"
```
107 changes: 107 additions & 0 deletions docs/user_guide_metax/install_guidance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
## Prerequisites

- MetaX GPU hardware
- Docker environment
- Network access to pull images and download models

## 1. Pull the Base Image

Please download docker images from (https://developer.metax-tech.com/) or contact Metax engineers to get the docker images.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least provide one usable image name.


Start a container (MetaX example):

```bash
docker_image=<metaX-base-image>
docker_name=verl_test
sudo docker run -itd \
--name ${docker_name} \
--net=host \
--uts=host \
--ipc=host \
--privileged=true \
--group-add video \
--shm-size 100gb \
--ulimit memlock=-1 \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--device=/dev/dri \
--device=/dev/mxcd \
--device=/dev/infiniband \
${docker_image} \
/bin/bash

docker exec -it verl_test bash
```

> **Note:** `/dev/mxcd` is the MetaX compute device and `/dev/dri` provides GPU rendering access — both are required for MetaX GPU workloads. Ensure `mx-smi` is available inside the container for hardware auto-detection. Add `-v` mounts for your data and model directories as needed (e.g., `-v /data/share/:/data/share/`).

## 2. Prepare Data and Models

```bash
cd /workspace

# Download model (example: Qwen3-8B)
modelscope download --model Qwen/Qwen3-8B --local_dir ./Qwen3-8B

# Download dataset (example: GSM8K)
mkdir gsm8k && cd gsm8k
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/train.parquet"
wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/test.parquet"
```

## 3. Install verl and verl-hardware-plugin

verl is the RL training framework. For detailed installation options, see: [verl Installation Guide](https://verl.readthedocs.io/en/latest/start/install.html).

verl-hardware-plugin provides the MetaX hardware platform integration for verl. For detailed information, see: [verl-hardware-plugin](https://github.com/verl-project/verl-hardware-plugin).

```bash
cd /workspace
git clone https://github.com/verl-project/verl
cd verl
pip install --no-build-isolation -v -e .

# Install verl-hardware-plugin
git clone https://github.com/verl-project/verl-hardware-plugin.git
cd verl-hardware-plugin
pip install --no-build-isolation -v -e .
```

## 4. Platform Configuration

Set the MetaX platform before launching training:

```bash
export VERL_PLATFORM=metax
```

Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Broken Link: The link [Environment Variables Reference](./env_reference.md) points to a non-existent file ./env_reference.md in the docs/user_guide_metax/ directory. Since there is no such file, we should remove this reference to avoid a broken link.

Suggested change
Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options.
Or let auto-detection handle it (requires mx-smi in PATH inside the container).


## Verification

After installation, verify the components are properly installed:

```bash
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python3 -c "import vllm; print('vLLM OK')"
python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Execution Error: Running ray.init(address='auto') will attempt to connect to an existing Ray cluster. Since these instructions are for setting up a new container, no Ray cluster is running yet, and this command will fail. Using ray.init() instead will start a local Ray instance and correctly verify GPU detection.

Suggested change
python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"

python3 -c "import transformer_engine; print('TransformerEngine OK')"
python3 -c "import megatron.core; print('Megatron-LM OK')"
python3 -c "import verl; print('verl OK')"
python3 -c "from verl.plugin.platform import get_platform;p = get_platform();print(f'device: {p.device_name}');print(f'vendor: {p.vendor_name}');print(f'available: {p.is_available()}')"
```

Verify MetaX hardware detection:

```bash
mx-smi -L # List MetaX GPUs
python3 -c "
import subprocess, sys
ret = subprocess.run(['mx-smi', '-L'], capture_output=True, text=True)
if ret.returncode == 0:
print('MetaX platform detected')
else:
print('MetaX platform not detected')
sys.exit(1)
"
```
23 changes: 23 additions & 0 deletions docs/user_guide_metax/quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# MetaX Quick Start

## Introduction
This guide walks you through running a GRPO training job with verl on the MetaX platform. Make sure you have completed the [Installation Guide](./install_guidance.md) first.

## Running a Training Script

Metax provides verl example training scripts under `/workspace/verl/examples`.

Launch the training (example for Qwen3-8B on 8 GPUs):

```bash
cd /workspace/verl/examples/grpo_trainer/
bash train_qwen3-8b-8gpus.sh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```

Training is running successfully if you see step-level progress output in the logs.

> Adjust environment variables in the script (model path, data path, batch size, etc.) to match your setup before running.

## Next Steps

- See the [FAQ](./faq.md) for troubleshooting common issues.
Loading