-
Notifications
You must be signed in to change notification settings - Fork 9
[Metax] Add metax support #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # MetaX FAQ and Troubleshooting | ||
|
|
||
| ## Common Questions | ||
|
|
||
| ### Q: How do I select the MetaX platform? | ||
|
|
||
| Set the environment variable: | ||
|
|
||
| ```bash | ||
| export VERL_PLATFORM=metax | ||
| ``` | ||
|
|
||
| Or let auto-detection handle it — ensure `mx-smi` is available, and verl will detect MetaX hardware automatically. | ||
|
|
||
| ### Q: What is the difference between `metax` and `nvidia` in CUDA-compatible mode? | ||
|
|
||
| Both MetaX and NVIDIA use `torch.cuda` underneath (device name: `cuda`). The `vendor_name` distinguishes them: | ||
| - `metax` → MetaX GPU hardware (detected via `mx-smi`) | ||
| - `nvidia` → NVIDIA GPU hardware (detected via `nvidia-smi`) | ||
|
|
||
| The vendor distinction ensures the correct platform-specific engine is selected during training. | ||
|
|
||
| ### Q: How does verl distinguish MetaX from NVIDIA? | ||
|
|
||
| During auto-detection, verl runs the SMI command (`mx-smi` for MetaX, `nvidia-smi` for NVIDIA). Since both are CUDA-compatible and `torch.cuda.is_available()` returns `True` on both, the SMI check is the only reliable way to distinguish them. | ||
|
|
||
| You can bypass auto-detection by setting `VERL_PLATFORM=metax` explicitly. | ||
|
|
||
| ### Q: What communication backend does MetaX use? | ||
|
|
||
| MetaX uses standard **NCCL** for distributed communication. No special communication library is required. | ||
|
|
||
| ### Q: Does MetaX support FSDP and Megatron training? | ||
|
|
||
| Yes. Both FSDP and Megatron engines are registered for MetaX: | ||
|
|
||
| - **FSDP**: `FSDPMetaXEngineWithLMHead` / `FSDPMetaXEngineWithValueHead` (backend: `fsdp`, `fsdp2`) | ||
| - **Megatron**: `MegatronMetaXEngineWithLMHead` (backend: `megatron`) | ||
|
|
||
| --- | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### CUDA out of memory during training | ||
|
|
||
| - Enable parameter offload and optimizer offload in FSDP config: | ||
| ``` | ||
| actor_rollout_ref.actor.fsdp_config.param_offload=True | ||
| actor_rollout_ref.actor.fsdp_config.optimizer_offload=True | ||
| ``` | ||
| - Reduce `ppo_max_token_len_per_gpu` or `train_batch_size`. | ||
| - Lower `rollout_gpu_mem_util` to reserve more memory for training. | ||
|
|
||
|
|
||
| ### Ray device detection issues in Docker | ||
|
|
||
| If Ray cannot detect devices inside a `--privileged` Docker container: | ||
|
|
||
| ```bash | ||
| export RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 | ||
| ``` | ||
|
|
||
| This disables the override that prevents device detection when the visibility environment variable resolves to zero devices. | ||
|
|
||
| ### Platform not detected during import | ||
|
|
||
| If `verl_hardware_plugin` imports but MetaX is not registered, check the logs: | ||
|
|
||
| ```bash | ||
| export VERL_LOGGING_LEVEL=DEBUG | ||
| python3 -c "import verl_hardware_plugin" | ||
| ``` | ||
|
|
||
| Look for messages like: | ||
| - `Registered platform: metax (cuda)` — Platform registered successfully | ||
| - `MetaX platform not registered: <error>` — Import failed, check the error message | ||
|
|
||
| ### Verifying MetaX GPU accessibility | ||
|
|
||
| ```bash | ||
| # Check torch CUDA access | ||
| python3 -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')" | ||
|
|
||
| # Check mx-smi | ||
| mx-smi -L | ||
|
|
||
| # Check from within verl | ||
| python3 -c " | ||
| from verl.plugin.platform import get_platform | ||
| p = get_platform() | ||
| print(f'Platform: {p.vendor_name}') | ||
| print(f'Device: {p.device_name}') | ||
| print(f'Devices: {p.device_count()}') | ||
| print(f'Available: {p.is_available()}') | ||
| " | ||
| ``` |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,107 @@ | ||||||
| ## Prerequisites | ||||||
|
|
||||||
| - MetaX GPU hardware | ||||||
| - Docker environment | ||||||
| - Network access to pull images and download models | ||||||
|
|
||||||
| ## 1. Pull the Base Image | ||||||
|
|
||||||
| Please download docker images from (https://developer.metax-tech.com/) or contact Metax engineers to get the docker images. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At least provide one usable image name. |
||||||
|
|
||||||
| Start a container (MetaX example): | ||||||
|
|
||||||
| ```bash | ||||||
| docker_image=<metaX-base-image> | ||||||
| docker_name=verl_test | ||||||
| sudo docker run -itd \ | ||||||
| --name ${docker_name} \ | ||||||
| --net=host \ | ||||||
| --uts=host \ | ||||||
| --ipc=host \ | ||||||
| --privileged=true \ | ||||||
| --group-add video \ | ||||||
| --shm-size 100gb \ | ||||||
| --ulimit memlock=-1 \ | ||||||
| --security-opt seccomp=unconfined \ | ||||||
| --security-opt apparmor=unconfined \ | ||||||
| --device=/dev/dri \ | ||||||
| --device=/dev/mxcd \ | ||||||
| --device=/dev/infiniband \ | ||||||
| ${docker_image} \ | ||||||
| /bin/bash | ||||||
|
|
||||||
| docker exec -it verl_test bash | ||||||
| ``` | ||||||
|
|
||||||
| > **Note:** `/dev/mxcd` is the MetaX compute device and `/dev/dri` provides GPU rendering access — both are required for MetaX GPU workloads. Ensure `mx-smi` is available inside the container for hardware auto-detection. Add `-v` mounts for your data and model directories as needed (e.g., `-v /data/share/:/data/share/`). | ||||||
|
|
||||||
| ## 2. Prepare Data and Models | ||||||
|
|
||||||
| ```bash | ||||||
| cd /workspace | ||||||
|
|
||||||
| # Download model (example: Qwen3-8B) | ||||||
| modelscope download --model Qwen/Qwen3-8B --local_dir ./Qwen3-8B | ||||||
|
|
||||||
| # Download dataset (example: GSM8K) | ||||||
| mkdir gsm8k && cd gsm8k | ||||||
| wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/train.parquet" | ||||||
| wget "https://baai-flagscale.ks3-cn-beijing.ksyuncs.com/rl/datasets/gsm8k/test.parquet" | ||||||
| ``` | ||||||
|
|
||||||
| ## 3. Install verl and verl-hardware-plugin | ||||||
|
|
||||||
| verl is the RL training framework. For detailed installation options, see: [verl Installation Guide](https://verl.readthedocs.io/en/latest/start/install.html). | ||||||
|
|
||||||
| verl-hardware-plugin provides the MetaX hardware platform integration for verl. For detailed information, see: [verl-hardware-plugin](https://github.com/verl-project/verl-hardware-plugin). | ||||||
|
|
||||||
| ```bash | ||||||
| cd /workspace | ||||||
| git clone https://github.com/verl-project/verl | ||||||
| cd verl | ||||||
| pip install --no-build-isolation -v -e . | ||||||
|
|
||||||
| # Install verl-hardware-plugin | ||||||
| git clone https://github.com/verl-project/verl-hardware-plugin.git | ||||||
| cd verl-hardware-plugin | ||||||
| pip install --no-build-isolation -v -e . | ||||||
| ``` | ||||||
|
|
||||||
| ## 4. Platform Configuration | ||||||
|
|
||||||
| Set the MetaX platform before launching training: | ||||||
|
|
||||||
| ```bash | ||||||
| export VERL_PLATFORM=metax | ||||||
| ``` | ||||||
|
|
||||||
| Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Broken Link: The link
Suggested change
|
||||||
|
|
||||||
| ## Verification | ||||||
|
|
||||||
| After installation, verify the components are properly installed: | ||||||
|
|
||||||
| ```bash | ||||||
| python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" | ||||||
| python3 -c "import vllm; print('vLLM OK')" | ||||||
| python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Execution Error: Running
Suggested change
|
||||||
| python3 -c "import transformer_engine; print('TransformerEngine OK')" | ||||||
| python3 -c "import megatron.core; print('Megatron-LM OK')" | ||||||
| python3 -c "import verl; print('verl OK')" | ||||||
| python3 -c "from verl.plugin.platform import get_platform;p = get_platform();print(f'device: {p.device_name}');print(f'vendor: {p.vendor_name}');print(f'available: {p.is_available()}')" | ||||||
| ``` | ||||||
|
|
||||||
| Verify MetaX hardware detection: | ||||||
|
|
||||||
| ```bash | ||||||
| mx-smi -L # List MetaX GPUs | ||||||
| python3 -c " | ||||||
| import subprocess, sys | ||||||
| ret = subprocess.run(['mx-smi', '-L'], capture_output=True, text=True) | ||||||
| if ret.returncode == 0: | ||||||
| print('MetaX platform detected') | ||||||
| else: | ||||||
| print('MetaX platform not detected') | ||||||
| sys.exit(1) | ||||||
| " | ||||||
| ``` | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # MetaX Quick Start | ||
|
|
||
| ## Introduction | ||
| This guide walks you through running a GRPO training job with verl on the MetaX platform. Make sure you have completed the [Installation Guide](./install_guidance.md) first. | ||
|
|
||
| ## Running a Training Script | ||
|
|
||
| Metax provides verl example training scripts under `/workspace/verl/examples`. | ||
|
|
||
| Launch the training (example for Qwen3-8B on 8 GPUs): | ||
|
|
||
| ```bash | ||
| cd /workspace/verl/examples/grpo_trainer/ | ||
| bash train_qwen3-8b-8gpus.sh | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| ``` | ||
|
|
||
| Training is running successfully if you see step-level progress output in the logs. | ||
|
|
||
| > Adjust environment variables in the script (model path, data path, batch size, etc.) to match your setup before running. | ||
|
|
||
| ## Next Steps | ||
|
|
||
| - See the [FAQ](./faq.md) for troubleshooting common issues. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues on this line:
MCCLas the communication backend for MetaX, but both the FAQ (docs/user_guide_metax/faq.mdline 31) and the platform implementation (platform_cuda_metax.pyline 143) state that MetaX uses standardNCCL.[User Guide](docs/user_guide_metax/README.md)points to a non-existentREADME.mdfile in thedocs/user_guide_metax/directory. It should point todocs/user_guide_metax/install_guidance.mdinstead.