diff --git a/README.md b/README.md index ec739e0..4d954bf 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ The platforms and engines in this repository are **reference implementations** | FlagOS | NVIDIA GPU (verified) | FlagCX / NCCL | ✅ Supported | [User Guide](docs/user_guide_flagos/nvidia/README.md) | | Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD | | Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) | -| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD | +| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) | | Huawei NPU | Ascend 910B | HCCL | Built-in (verl core) | [Ascend Tutorial](https://github.com/verl-project/verl/tree/main/docs/ascend_tutorial) | diff --git a/docs/user_guide_metax/README.md b/docs/user_guide_metax/README.md index 6ee1f30..b15ea71 100644 --- a/docs/user_guide_metax/README.md +++ b/docs/user_guide_metax/README.md @@ -28,7 +28,7 @@ user_guide_metax/ |------|-------------| | Device type | `cuda` (CUDA-compatible) | | Vendor identifier | `metax` | -| Communication backend | `nccl` | +| Communication backend | `mccl` | | Device visibility env var | `CUDA_VISIBLE_DEVICES` | | Ray resource name | `GPU` | | IPC support | Yes | diff --git a/docs/user_guide_metax/faq.md b/docs/user_guide_metax/faq.md new file mode 100644 index 0000000..b4f3363 --- /dev/null +++ b/docs/user_guide_metax/faq.md @@ -0,0 +1,96 @@ +# MetaX FAQ and Troubleshooting + +## Common Questions + +### Q: How do I select the MetaX platform? + +Set the environment variable: + +```bash +export VERL_PLATFORM=metax +``` + +Or let auto-detection handle it — ensure `mx-smi` is available, and verl will detect MetaX hardware automatically. + +### Q: What is the difference between `metax` and `nvidia` in CUDA-compatible mode? + +Both MetaX and NVIDIA use `torch.cuda` underneath (device name: `cuda`). The `vendor_name` distinguishes them: +- `metax` → MetaX GPU hardware (detected via `mx-smi`) +- `nvidia` → NVIDIA GPU hardware (detected via `nvidia-smi`) + +The vendor distinction ensures the correct platform-specific engine is selected during training. + +### Q: How does verl distinguish MetaX from NVIDIA? + +During auto-detection, verl runs the SMI command (`mx-smi` for MetaX, `nvidia-smi` for NVIDIA). Since both are CUDA-compatible and `torch.cuda.is_available()` returns `True` on both, the SMI check is the only reliable way to distinguish them. + +You can bypass auto-detection by setting `VERL_PLATFORM=metax` explicitly. + +### Q: What communication backend does MetaX use? + +MetaX uses **MCCL** for distributed communication. + +### Q: Does MetaX support FSDP and Megatron training? + +Yes. Both FSDP and Megatron engines are registered for MetaX: + +- **FSDP**: `FSDPMetaXEngineWithLMHead` / `FSDPMetaXEngineWithValueHead` (backend: `fsdp`, `fsdp2`) +- **Megatron**: `MegatronMetaXEngineWithLMHead` (backend: `megatron`) + +--- + +## Troubleshooting + +### CUDA out of memory during training + +- Enable parameter offload and optimizer offload in FSDP config: + ``` + actor_rollout_ref.actor.fsdp_config.param_offload=True + actor_rollout_ref.actor.fsdp_config.optimizer_offload=True + ``` +- Reduce `ppo_max_token_len_per_gpu` or `train_batch_size`. +- Lower `rollout_gpu_mem_util` to reserve more memory for training. + + +### Ray device detection issues in Docker + +If Ray cannot detect devices inside a `--privileged` Docker container: + +```bash +export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 +``` + +This disables the override that prevents device detection when the visibility environment variable resolves to zero devices. + +### Platform not detected during import + +If `verl_hardware_plugin` imports but MetaX is not registered, check the logs: + +```bash +export VERL_LOGGING_LEVEL=DEBUG +python3 -c "import verl_hardware_plugin" +``` + +Look for messages like: +- `Registered platform: metax (cuda)` — Platform registered successfully +- `MetaX platform not registered: ` — Import failed, check the error message + +### Verifying MetaX GPU accessibility + +```bash +# Check torch CUDA access +python3 -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')" + +# Check mx-smi +mx-smi -L + +# Check from within verl +python3 -c " +from verl.plugin.platform import get_platform +p = get_platform() +print(f'Platform: {p.vendor_name}') +print(f'Device: {p.device_name}') +print(f'Devices: {p.device_count()}') +print(f'Available: {p.is_available()}') +" +``` diff --git a/docs/user_guide_metax/install_guidance.md b/docs/user_guide_metax/install_guidance.md new file mode 100644 index 0000000..904aec3 --- /dev/null +++ b/docs/user_guide_metax/install_guidance.md @@ -0,0 +1,108 @@ +## Prerequisites + +- MetaX GPU hardware +- Docker environment +- Network access to pull images and download models + +## 1. Pull the Base Image + +Visit the MetaX Docker Hub page (e.g., https://developer.metax-tech.com/softnova/docker?chip_name=%E6%9B%A6%E4%BA%91C500%E7%B3%BB%E5%88%97&package_name=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64) +Copy the docker pull command and run it in your terminal. + +Start a container (e.g., verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64): + +```bash +docker_image=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64 +docker_name=verl_test +sudo docker run -itd \ + --name ${docker_name} \ + --net=host \ + --uts=host \ + --ipc=host \ + --privileged=true \ + --group-add video \ + --shm-size 100gb \ + --ulimit memlock=-1 \ + --security-opt seccomp=unconfined \ + --security-opt apparmor=unconfined \ + --device=/dev/dri \ + --device=/dev/mxcd \ + --device=/dev/infiniband \ + ${docker_image} \ + /bin/bash + +docker exec -it verl_test bash +``` + +> **Note:** `/dev/mxcd` is the MetaX compute device and `/dev/dri` provides GPU rendering access — both are required for MetaX GPU workloads. Ensure `mx-smi` is available inside the container for hardware auto-detection. Add `-v` mounts for your data and model directories as needed (e.g., `-v /data/share/:/data/share/`). + +## 2. Prepare Data and Models + +```bash +cd /workspace + +# Download model (example: Qwen3-8B) +modelscope download --model Qwen/Qwen3-8B --local_dir ./Qwen3-8B + +# Download dataset (example: GSM8K) +mkdir gsm8k && cd gsm8k +wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/train.parquet" +wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/test.parquet" +``` + +## 3. Install verl and verl-hardware-plugin + +verl is the RL training framework. For detailed installation options, see: [verl Installation Guide](https://verl.readthedocs.io/en/latest/start/install.html). + +verl-hardware-plugin provides the MetaX hardware platform integration for verl. For detailed information, see: [verl-hardware-plugin](https://github.com/verl-project/verl-hardware-plugin). + +```bash +cd /workspace +git clone https://github.com/verl-project/verl verl_main +cd verl_main +pip install --no-build-isolation --no-dependencies -v -e . + +# Install verl-hardware-plugin +git clone https://github.com/verl-project/verl-hardware-plugin.git +cd verl-hardware-plugin +pip install --no-build-isolation -v -e . +``` + +## 4. Platform Configuration + +Set the MetaX platform before launching training: + +```bash +export VERL_PLATFORM=metax +``` + +Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). + +## Verification + +After installation, verify the components are properly installed: + +```bash +python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" +python3 -c "import vllm; print('vLLM OK')" +python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()" +python3 -c "import transformer_engine; print('TransformerEngine OK')" +python3 -c "import megatron.core; print('Megatron-LM OK')" +python3 -c "import verl; print('verl OK')" +python3 -c "from verl.plugin.platform import get_platform;p = get_platform();print(f'device: {p.device_name}');print(f'vendor: {p.vendor_name}');print(f'available: {p.is_available()}')" +``` + +Verify MetaX hardware detection: + +```bash +mx-smi -L # List MetaX GPUs +python3 -c " +import subprocess, sys +ret = subprocess.run(['mx-smi', '-L'], capture_output=True, text=True) +if ret.returncode == 0: + print('MetaX platform detected') +else: + print('MetaX platform not detected') + sys.exit(1) +" +``` diff --git a/docs/user_guide_metax/quick_start.md b/docs/user_guide_metax/quick_start.md new file mode 100644 index 0000000..c6b3338 --- /dev/null +++ b/docs/user_guide_metax/quick_start.md @@ -0,0 +1,151 @@ +# MetaX Quick Start + +## Introduction +This guide walks you through running a GRPO training job with verl on the MetaX platform. Make sure you have completed the [Installation Guide](./install_guidance.md) first. + +## Running a Training Script + +Save the following as `run_qwen3_0.6b_grpo_gsm8k_metax.sh`. You must adjust `DATA_DIR` and `MODEL_DIR` to your local paths. + +```bash +#!/usr/bin/env bash +# GRPO | Qwen3-0.6B | FSDP training | NVIDIA GPUs with FlagOS + +set -xeuo pipefail + +export PYTORCH_ENABLE_SAME_RANK_A100=1 +export SET_DEVICE_NUMA_PREFERRED=1 +export HYDRA_FULL_ERROR=1 +export CUDA_DEVICE_MAX_CONNECTIONS=1 +export MCPYTORCH_DISABLE_PRINT=1 +export MAX_JOBS=20 +unset PAGEABLE_MEMCPY_ASYNC +unset PYTORCH_CUDA_ALLOC_CONF + +export MCCL_MAX_NCHANNELS=16 +export PYTHONUNBUFFERED=1 +export VERL_PLATFORM=metax +export MACA_MPS_MODE=1 + +########################### user-adjustable ########################### +DEVICE=${DEVICE:-gpu} +INFER_BACKEND=${INFER_BACKEND:-vllm} + +NNODES=${NNODES:-1} +NGPUS_PER_NODE=${NGPUS_PER_NODE:-8} + +train_batch_size=${TRAIN_BATCH_SIZE:-64} +ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE:-16} +max_prompt_length=${MAX_PROMPT_LENGTH:-1024} +max_response_length=${MAX_RESPONSE_LENGTH:-1024} +ppo_max_token_len_per_gpu=${PPO_MAX_TOKEN_LEN_PER_GPU:-24576} + +actor_lr=${ACTOR_LR:-1e-6} +kl_loss_coef=${KL_LOSS_COEF:-0.001} +entropy_coeff=${ENTROPY_COEFF:-0} + +rollout_tp=${ROLLOUT_TP:-2} +rollout_gpu_mem_util=${ROLLOUT_GPU_MEM_UTIL:-0.3} +rollout_n=${ROLLOUT_N:-5} + +total_epochs=${TOTAL_EPOCHS:-15} +save_freq=${SAVE_FREQ:-20} +test_freq=${TEST_FREQ:-5} + +PROJECT_NAME=${PROJECT_NAME:-verl_grpo_gsm8k_math} +EXPERIMENT_NAME=${EXPERIMENT_NAME:-qwen3_0.6b_grpo_${INFER_BACKEND}_fsdp_$(date +%Y%m%d_%H%M)} +########################### end user-adjustable ########################### + +########################### parameter arrays ########################### +# Modify these paths to your actual data/model locations +DATA_DIR=/data1/dh/gsm8k +MODEL_DIR=/data1/dh/Qwen3-0.6B + +n_trainer_devices=$NGPUS_PER_NODE + +DATA=( + algorithm.adv_estimator=grpo + algorithm.use_kl_in_reward=False + data.train_files="['$DATA_DIR/train.parquet']" + data.val_files="['$DATA_DIR/test.parquet']" + data.train_batch_size=${train_batch_size} + data.max_prompt_length=${max_prompt_length} + data.max_response_length=${max_response_length} + data.filter_overlong_prompts=True + data.truncation='error' +) + +MODEL=( + actor_rollout_ref.model.path="$MODEL_DIR" + actor_rollout_ref.model.use_remove_padding=True + actor_rollout_ref.model.enable_gradient_checkpointing=True +) + +ACTOR=( + actor_rollout_ref.actor.optim.lr=${actor_lr} + actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} + actor_rollout_ref.actor.use_dynamic_bsz=True + actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} + actor_rollout_ref.actor.use_kl_loss=True + actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} + actor_rollout_ref.actor.kl_loss_type=low_var_kl + actor_rollout_ref.actor.entropy_coeff=${entropy_coeff} + actor_rollout_ref.actor.fsdp_config.param_offload=True + actor_rollout_ref.actor.fsdp_config.optimizer_offload=True +) + +ROLLOUT=( + actor_rollout_ref.rollout.name=${INFER_BACKEND} + actor_rollout_ref.rollout.tensor_model_parallel_size=${rollout_tp} + actor_rollout_ref.rollout.gpu_memory_utilization=${rollout_gpu_mem_util} + actor_rollout_ref.rollout.n=${rollout_n} + actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True + actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} + +actor_rollout_ref.rollout.enable_sleep_mode=False + actor_rollout_ref.rollout.free_cache_engine=False +) + +REF=( + actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True + actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} + actor_rollout_ref.ref.fsdp_config.param_offload=True +) + +TRAINER=( + trainer.balance_batch=True + trainer.logger=['console','swanlab'] \ + trainer.project_name=${PROJECT_NAME} + trainer.experiment_name=${EXPERIMENT_NAME} + trainer.n_gpus_per_node=${n_trainer_devices} + trainer.nnodes=${NNODES} + trainer.save_freq=${save_freq} + trainer.test_freq=${test_freq} + trainer.total_epochs=${total_epochs} +) + +HYDRA_FULL_ERROR=1 +########################### launch ########################### +python3 -m verl.trainer.main_ppo \ + "${DATA[@]}" \ + "${MODEL[@]}" \ + "${ACTOR[@]}" \ + "${ROLLOUT[@]}" \ + "${REF[@]}" \ + "${TRAINER[@]}" \ + "$@" \ + 2>&1 | tee -a "verl_demo.log" + +``` +Launch the training: + +```bash +bash run_qwen3_0.6b_grpo_gsm8k_metax.sh +``` + +Training is running successfully if you see step-level progress output in the logs. + +**Training Log**: [verl_grpo_gsm8k_metax](https://swanlab.cn/@harward/verl_grpo_gsm8k_metax/runs/9xnvjmby/chart) + +## Next Steps + +- See the [FAQ](./faq.md) for troubleshooting common issues.