[Metax] Add metax support#7
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds comprehensive documentation for the MetaX GPU platform integration, including installation guides, FAQs, and quick start instructions. The review feedback points out a few issues: an incorrect communication backend (MCCL instead of NCCL) and a broken link in the main README, a broken reference link in the installation guide, and an incorrect Ray initialization command in the verification script that would fail in a new container environment.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| | Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD | | ||
| | Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) | | ||
| | MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD | | ||
| | MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) | |
There was a problem hiding this comment.
There are two issues on this line:
- Inconsistent Communication Backend: The table lists
MCCLas the communication backend for MetaX, but both the FAQ (docs/user_guide_metax/faq.mdline 31) and the platform implementation (platform_cuda_metax.pyline 143) state that MetaX uses standardNCCL. - Broken Link: The link
[User Guide](docs/user_guide_metax/README.md)points to a non-existentREADME.mdfile in thedocs/user_guide_metax/directory. It should point todocs/user_guide_metax/install_guidance.mdinstead.
| | MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) | | |
| | MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) | |
| export VERL_PLATFORM=metax | ||
| ``` | ||
|
|
||
| Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options. |
There was a problem hiding this comment.
Broken Link: The link [Environment Variables Reference](./env_reference.md) points to a non-existent file ./env_reference.md in the docs/user_guide_metax/ directory. Since there is no such file, we should remove this reference to avoid a broken link.
| Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options. | |
| Or let auto-detection handle it (requires mx-smi in PATH inside the container). |
| ```bash | ||
| python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" | ||
| python3 -c "import vllm; print('vLLM OK')" | ||
| python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()" |
There was a problem hiding this comment.
Execution Error: Running ray.init(address='auto') will attempt to connect to an existing Ray cluster. Since these instructions are for setting up a new container, no Ray cluster is running yet, and this command will fail. Using ray.init() instead will start a local Ray instance and correctly verify GPU detection.
| python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()" | |
| python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()" |
|
|
||
| ## 1. Pull the Base Image | ||
|
|
||
| Please download docker images from (https://developer.metax-tech.com/) or contact Metax engineers to get the docker images. |
There was a problem hiding this comment.
At least provide one usable image name.
|
|
||
| ```bash | ||
| cd /workspace/verl/examples/grpo_trainer/ | ||
| bash train_qwen3-8b-8gpus.sh |
There was a problem hiding this comment.
Summary
Add MetaX vendor support for verl hardware plugin
Motivation
MetaX platform code (
platform_cuda_metax.py,fsdp_metax.py,megatron_metax.py) was already in place , but the user-facing documentation was completely missing.This PR completes the MetaX user guide with installation instructions, quick start, and FAQ.Changes
docs/user_guide_metax/install_guidance.md— Docker setup with MetaX-specific devices SDK installation, and verificationdocs/user_guide_metax/quick_start.md— GRPO training quick start pointing toverl/examples/grpo_trainer/docs/user_guide_metax/faq.md— Platform selection, MetaX vs NVIDIA distinction, troubleshooting (OOM, Ray, mx-smi detection)README.md— MetaX row: doc link fromTBD→ actual guide, expand user guide section with inline sub-doc linksTesting
pytest tests/ -vpassesChecklist
pre-commitchecks