-
Notifications
You must be signed in to change notification settings - Fork 9
[Metax] Add metax support #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # MetaX FAQ and Troubleshooting | ||
|
|
||
| ## Common Questions | ||
|
|
||
| ### Q: How do I select the MetaX platform? | ||
|
|
||
| Set the environment variable: | ||
|
|
||
| ```bash | ||
| export VERL_PLATFORM=metax | ||
| ``` | ||
|
|
||
| Or let auto-detection handle it — ensure `mx-smi` is available, and verl will detect MetaX hardware automatically. | ||
|
|
||
| ### Q: What is the difference between `metax` and `nvidia` in CUDA-compatible mode? | ||
|
|
||
| Both MetaX and NVIDIA use `torch.cuda` underneath (device name: `cuda`). The `vendor_name` distinguishes them: | ||
| - `metax` → MetaX GPU hardware (detected via `mx-smi`) | ||
| - `nvidia` → NVIDIA GPU hardware (detected via `nvidia-smi`) | ||
|
|
||
| The vendor distinction ensures the correct platform-specific engine is selected during training. | ||
|
|
||
| ### Q: How does verl distinguish MetaX from NVIDIA? | ||
|
|
||
| During auto-detection, verl runs the SMI command (`mx-smi` for MetaX, `nvidia-smi` for NVIDIA). Since both are CUDA-compatible and `torch.cuda.is_available()` returns `True` on both, the SMI check is the only reliable way to distinguish them. | ||
|
|
||
| You can bypass auto-detection by setting `VERL_PLATFORM=metax` explicitly. | ||
|
|
||
| ### Q: What communication backend does MetaX use? | ||
|
|
||
| MetaX uses **MCCL** for distributed communication. | ||
|
|
||
| ### Q: Does MetaX support FSDP and Megatron training? | ||
|
|
||
| Yes. Both FSDP and Megatron engines are registered for MetaX: | ||
|
|
||
| - **FSDP**: `FSDPMetaXEngineWithLMHead` / `FSDPMetaXEngineWithValueHead` (backend: `fsdp`, `fsdp2`) | ||
| - **Megatron**: `MegatronMetaXEngineWithLMHead` (backend: `megatron`) | ||
|
|
||
| --- | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### CUDA out of memory during training | ||
|
|
||
| - Enable parameter offload and optimizer offload in FSDP config: | ||
| ``` | ||
| actor_rollout_ref.actor.fsdp_config.param_offload=True | ||
| actor_rollout_ref.actor.fsdp_config.optimizer_offload=True | ||
| ``` | ||
| - Reduce `ppo_max_token_len_per_gpu` or `train_batch_size`. | ||
| - Lower `rollout_gpu_mem_util` to reserve more memory for training. | ||
|
|
||
|
|
||
| ### Ray device detection issues in Docker | ||
|
|
||
| If Ray cannot detect devices inside a `--privileged` Docker container: | ||
|
|
||
| ```bash | ||
| export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 | ||
| ``` | ||
|
|
||
| This disables the override that prevents device detection when the visibility environment variable resolves to zero devices. | ||
|
|
||
| ### Platform not detected during import | ||
|
|
||
| If `verl_hardware_plugin` imports but MetaX is not registered, check the logs: | ||
|
|
||
| ```bash | ||
| export VERL_LOGGING_LEVEL=DEBUG | ||
| python3 -c "import verl_hardware_plugin" | ||
| ``` | ||
|
|
||
| Look for messages like: | ||
| - `Registered platform: metax (cuda)` — Platform registered successfully | ||
| - `MetaX platform not registered: <error>` — Import failed, check the error message | ||
|
|
||
| ### Verifying MetaX GPU accessibility | ||
|
|
||
| ```bash | ||
| # Check torch CUDA access | ||
| python3 -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')" | ||
|
|
||
| # Check mx-smi | ||
| mx-smi -L | ||
|
|
||
| # Check from within verl | ||
| python3 -c " | ||
| from verl.plugin.platform import get_platform | ||
| p = get_platform() | ||
| print(f'Platform: {p.vendor_name}') | ||
| print(f'Device: {p.device_name}') | ||
| print(f'Devices: {p.device_count()}') | ||
| print(f'Available: {p.is_available()}') | ||
| " | ||
| ``` |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,108 @@ | ||||||
| ## Prerequisites | ||||||
|
|
||||||
| - MetaX GPU hardware | ||||||
| - Docker environment | ||||||
| - Network access to pull images and download models | ||||||
|
|
||||||
| ## 1. Pull the Base Image | ||||||
|
|
||||||
| Visit the MetaX Docker Hub page (e.g., https://developer.metax-tech.com/softnova/docker?chip_name=%E6%9B%A6%E4%BA%91C500%E7%B3%BB%E5%88%97&package_name=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64) | ||||||
| Copy the docker pull command and run it in your terminal. | ||||||
|
|
||||||
| Start a container (e.g., verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64): | ||||||
|
|
||||||
| ```bash | ||||||
| docker_image=verl:0.7.1-maca.ai3.5.3.3-torch2.8-py310-ubuntu22.04-amd64 | ||||||
| docker_name=verl_test | ||||||
| sudo docker run -itd \ | ||||||
| --name ${docker_name} \ | ||||||
| --net=host \ | ||||||
| --uts=host \ | ||||||
| --ipc=host \ | ||||||
| --privileged=true \ | ||||||
| --group-add video \ | ||||||
| --shm-size 100gb \ | ||||||
| --ulimit memlock=-1 \ | ||||||
| --security-opt seccomp=unconfined \ | ||||||
| --security-opt apparmor=unconfined \ | ||||||
| --device=/dev/dri \ | ||||||
| --device=/dev/mxcd \ | ||||||
| --device=/dev/infiniband \ | ||||||
| ${docker_image} \ | ||||||
| /bin/bash | ||||||
|
|
||||||
| docker exec -it verl_test bash | ||||||
| ``` | ||||||
|
|
||||||
| > **Note:** `/dev/mxcd` is the MetaX compute device and `/dev/dri` provides GPU rendering access — both are required for MetaX GPU workloads. Ensure `mx-smi` is available inside the container for hardware auto-detection. Add `-v` mounts for your data and model directories as needed (e.g., `-v /data/share/:/data/share/`). | ||||||
|
|
||||||
| ## 2. Prepare Data and Models | ||||||
|
|
||||||
| ```bash | ||||||
| cd /workspace | ||||||
|
|
||||||
| # Download model (example: Qwen3-8B) | ||||||
| modelscope download --model Qwen/Qwen3-8B --local_dir ./Qwen3-8B | ||||||
|
|
||||||
| # Download dataset (example: GSM8K) | ||||||
| mkdir gsm8k && cd gsm8k | ||||||
| wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/train.parquet" | ||||||
| wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/test.parquet" | ||||||
| ``` | ||||||
|
|
||||||
| ## 3. Install verl and verl-hardware-plugin | ||||||
|
|
||||||
| verl is the RL training framework. For detailed installation options, see: [verl Installation Guide](https://verl.readthedocs.io/en/latest/start/install.html). | ||||||
|
|
||||||
| verl-hardware-plugin provides the MetaX hardware platform integration for verl. For detailed information, see: [verl-hardware-plugin](https://github.com/verl-project/verl-hardware-plugin). | ||||||
|
|
||||||
| ```bash | ||||||
| cd /workspace | ||||||
| git clone https://github.com/verl-project/verl verl_main | ||||||
| cd verl_main | ||||||
| pip install --no-build-isolation --no-dependencies -v -e . | ||||||
|
|
||||||
| # Install verl-hardware-plugin | ||||||
| git clone https://github.com/verl-project/verl-hardware-plugin.git | ||||||
| cd verl-hardware-plugin | ||||||
| pip install --no-build-isolation -v -e . | ||||||
| ``` | ||||||
|
|
||||||
| ## 4. Platform Configuration | ||||||
|
|
||||||
| Set the MetaX platform before launching training: | ||||||
|
|
||||||
| ```bash | ||||||
| export VERL_PLATFORM=metax | ||||||
| ``` | ||||||
|
|
||||||
| Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). | ||||||
|
|
||||||
| ## Verification | ||||||
|
|
||||||
| After installation, verify the components are properly installed: | ||||||
|
|
||||||
| ```bash | ||||||
| python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" | ||||||
| python3 -c "import vllm; print('vLLM OK')" | ||||||
| python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Execution Error: Running
Suggested change
|
||||||
| python3 -c "import transformer_engine; print('TransformerEngine OK')" | ||||||
| python3 -c "import megatron.core; print('Megatron-LM OK')" | ||||||
| python3 -c "import verl; print('verl OK')" | ||||||
| python3 -c "from verl.plugin.platform import get_platform;p = get_platform();print(f'device: {p.device_name}');print(f'vendor: {p.vendor_name}');print(f'available: {p.is_available()}')" | ||||||
| ``` | ||||||
|
|
||||||
| Verify MetaX hardware detection: | ||||||
|
|
||||||
| ```bash | ||||||
| mx-smi -L # List MetaX GPUs | ||||||
| python3 -c " | ||||||
| import subprocess, sys | ||||||
| ret = subprocess.run(['mx-smi', '-L'], capture_output=True, text=True) | ||||||
| if ret.returncode == 0: | ||||||
| print('MetaX platform detected') | ||||||
| else: | ||||||
| print('MetaX platform not detected') | ||||||
| sys.exit(1) | ||||||
| " | ||||||
| ``` | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| # MetaX Quick Start | ||
|
|
||
| ## Introduction | ||
| This guide walks you through running a GRPO training job with verl on the MetaX platform. Make sure you have completed the [Installation Guide](./install_guidance.md) first. | ||
|
|
||
| ## Running a Training Script | ||
|
|
||
| Save the following as `run_qwen3_0.6b_grpo_gsm8k_metax.sh`. You must adjust `DATA_DIR` and `MODEL_DIR` to your local paths. | ||
|
|
||
| ```bash | ||
| #!/usr/bin/env bash | ||
| # GRPO | Qwen3-0.6B | FSDP training | NVIDIA GPUs with FlagOS | ||
|
|
||
| set -xeuo pipefail | ||
|
|
||
| export PYTORCH_ENABLE_SAME_RANK_A100=1 | ||
| export SET_DEVICE_NUMA_PREFERRED=1 | ||
| export HYDRA_FULL_ERROR=1 | ||
| export CUDA_DEVICE_MAX_CONNECTIONS=1 | ||
| export MCPYTORCH_DISABLE_PRINT=1 | ||
| export MAX_JOBS=20 | ||
| unset PAGEABLE_MEMCPY_ASYNC | ||
| unset PYTORCH_CUDA_ALLOC_CONF | ||
|
|
||
| export MCCL_MAX_NCHANNELS=16 | ||
| export PYTHONUNBUFFERED=1 | ||
| export VERL_PLATFORM=metax | ||
| export MACA_MPS_MODE=1 | ||
|
|
||
| ########################### user-adjustable ########################### | ||
| DEVICE=${DEVICE:-gpu} | ||
| INFER_BACKEND=${INFER_BACKEND:-vllm} | ||
|
|
||
| NNODES=${NNODES:-1} | ||
| NGPUS_PER_NODE=${NGPUS_PER_NODE:-8} | ||
|
|
||
| train_batch_size=${TRAIN_BATCH_SIZE:-64} | ||
| ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE:-16} | ||
| max_prompt_length=${MAX_PROMPT_LENGTH:-1024} | ||
| max_response_length=${MAX_RESPONSE_LENGTH:-1024} | ||
| ppo_max_token_len_per_gpu=${PPO_MAX_TOKEN_LEN_PER_GPU:-24576} | ||
|
|
||
| actor_lr=${ACTOR_LR:-1e-6} | ||
| kl_loss_coef=${KL_LOSS_COEF:-0.001} | ||
| entropy_coeff=${ENTROPY_COEFF:-0} | ||
|
|
||
| rollout_tp=${ROLLOUT_TP:-2} | ||
| rollout_gpu_mem_util=${ROLLOUT_GPU_MEM_UTIL:-0.3} | ||
| rollout_n=${ROLLOUT_N:-5} | ||
|
|
||
| total_epochs=${TOTAL_EPOCHS:-15} | ||
| save_freq=${SAVE_FREQ:-20} | ||
| test_freq=${TEST_FREQ:-5} | ||
|
|
||
| PROJECT_NAME=${PROJECT_NAME:-verl_grpo_gsm8k_math} | ||
| EXPERIMENT_NAME=${EXPERIMENT_NAME:-qwen3_0.6b_grpo_${INFER_BACKEND}_fsdp_$(date +%Y%m%d_%H%M)} | ||
| ########################### end user-adjustable ########################### | ||
|
|
||
| ########################### parameter arrays ########################### | ||
| # Modify these paths to your actual data/model locations | ||
| DATA_DIR=/data1/dh/gsm8k | ||
| MODEL_DIR=/data1/dh/Qwen3-0.6B | ||
|
|
||
| n_trainer_devices=$NGPUS_PER_NODE | ||
|
|
||
| DATA=( | ||
| algorithm.adv_estimator=grpo | ||
| algorithm.use_kl_in_reward=False | ||
| data.train_files="['$DATA_DIR/train.parquet']" | ||
| data.val_files="['$DATA_DIR/test.parquet']" | ||
| data.train_batch_size=${train_batch_size} | ||
| data.max_prompt_length=${max_prompt_length} | ||
| data.max_response_length=${max_response_length} | ||
| data.filter_overlong_prompts=True | ||
| data.truncation='error' | ||
| ) | ||
|
|
||
| MODEL=( | ||
| actor_rollout_ref.model.path="$MODEL_DIR" | ||
| actor_rollout_ref.model.use_remove_padding=True | ||
| actor_rollout_ref.model.enable_gradient_checkpointing=True | ||
| ) | ||
|
|
||
| ACTOR=( | ||
| actor_rollout_ref.actor.optim.lr=${actor_lr} | ||
| actor_rollout_ref.actor.ppo_mini_batch_size=${ppo_mini_batch_size} | ||
| actor_rollout_ref.actor.use_dynamic_bsz=True | ||
| actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} | ||
| actor_rollout_ref.actor.use_kl_loss=True | ||
| actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} | ||
| actor_rollout_ref.actor.kl_loss_type=low_var_kl | ||
| actor_rollout_ref.actor.entropy_coeff=${entropy_coeff} | ||
| actor_rollout_ref.actor.fsdp_config.param_offload=True | ||
| actor_rollout_ref.actor.fsdp_config.optimizer_offload=True | ||
| ) | ||
|
|
||
| ROLLOUT=( | ||
| actor_rollout_ref.rollout.name=${INFER_BACKEND} | ||
| actor_rollout_ref.rollout.tensor_model_parallel_size=${rollout_tp} | ||
| actor_rollout_ref.rollout.gpu_memory_utilization=${rollout_gpu_mem_util} | ||
| actor_rollout_ref.rollout.n=${rollout_n} | ||
| actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True | ||
| actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} | ||
| +actor_rollout_ref.rollout.enable_sleep_mode=False | ||
| actor_rollout_ref.rollout.free_cache_engine=False | ||
| ) | ||
|
|
||
| REF=( | ||
| actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True | ||
| actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} | ||
| actor_rollout_ref.ref.fsdp_config.param_offload=True | ||
| ) | ||
|
|
||
| TRAINER=( | ||
| trainer.balance_batch=True | ||
| trainer.logger=['console','swanlab'] \ | ||
| trainer.project_name=${PROJECT_NAME} | ||
| trainer.experiment_name=${EXPERIMENT_NAME} | ||
| trainer.n_gpus_per_node=${n_trainer_devices} | ||
| trainer.nnodes=${NNODES} | ||
| trainer.save_freq=${save_freq} | ||
| trainer.test_freq=${test_freq} | ||
| trainer.total_epochs=${total_epochs} | ||
| ) | ||
|
|
||
| HYDRA_FULL_ERROR=1 | ||
| ########################### launch ########################### | ||
| python3 -m verl.trainer.main_ppo \ | ||
| "${DATA[@]}" \ | ||
| "${MODEL[@]}" \ | ||
| "${ACTOR[@]}" \ | ||
| "${ROLLOUT[@]}" \ | ||
| "${REF[@]}" \ | ||
| "${TRAINER[@]}" \ | ||
| "$@" \ | ||
| 2>&1 | tee -a "verl_demo.log" | ||
|
|
||
| ``` | ||
| Launch the training: | ||
|
|
||
| ```bash | ||
| bash run_qwen3_0.6b_grpo_gsm8k_metax.sh | ||
| ``` | ||
|
|
||
| Training is running successfully if you see step-level progress output in the logs. | ||
|
|
||
| **Training Log**: [verl_grpo_gsm8k_metax](https://swanlab.cn/@harward/verl_grpo_gsm8k_metax/runs/9xnvjmby/chart) | ||
|
|
||
| ## Next Steps | ||
|
|
||
| - See the [FAQ](./faq.md) for troubleshooting common issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues on this line:
MCCLas the communication backend for MetaX, but both the FAQ (docs/user_guide_metax/faq.mdline 31) and the platform implementation (platform_cuda_metax.pyline 143) state that MetaX uses standardNCCL.[User Guide](docs/user_guide_metax/README.md)points to a non-existentREADME.mdfile in thedocs/user_guide_metax/directory. It should point todocs/user_guide_metax/install_guidance.mdinstead.