Skip to content

[Metax] Add metax support#7

Merged
heavyrain-lzy merged 4 commits into
verl-project:mainfrom
dinghaodhd:main
Jun 29, 2026
Merged

[Metax] Add metax support#7
heavyrain-lzy merged 4 commits into
verl-project:mainfrom
dinghaodhd:main

Conversation

@dinghaodhd

@dinghaodhd dinghaodhd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Add MetaX vendor support for verl hardware plugin

Motivation

MetaX platform code (platform_cuda_metax.py, fsdp_metax.py, megatron_metax.py) was already in place , but the user-facing documentation was completely missing.This PR completes the MetaX user guide with installation instructions, quick start, and FAQ.

Changes

  • Add docs/user_guide_metax/install_guidance.md — Docker setup with MetaX-specific devices SDK installation, and verification
  • Add docs/user_guide_metax/quick_start.md — GRPO training quick start pointing to verl/examples/grpo_trainer/
  • Add docs/user_guide_metax/faq.md — Platform selection, MetaX vs NVIDIA distinction, troubleshooting (OOM, Ray, mx-smi detection)
  • Update README.md — MetaX row: doc link from TBD → actual guide, expand user guide section with inline sub-doc links

Testing

  • pytest tests/ -v passes
  • Manually verified on target hardware (if applicable)

Checklist

  • Code follows the project's style and passes pre-commit checks
  • Documentation updated (if applicable)
  • No secrets or credentials included

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive documentation for the MetaX GPU platform integration, including installation guides, FAQs, and quick start instructions. The review feedback points out a few issues: an incorrect communication backend (MCCL instead of NCCL) and a broken link in the main README, a broken reference link in the installation guide, and an incorrect Ray initialization command in the verification script that would fail in a new container environment.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread README.md
| Intel XPU | Data Center GPU Max / Arc | xccl (oneCCL) | ✅ Example (requires vendor support) | TBD |
| Cambricon MLU | MLU370 / MLU590 | CNCL | ✅ Supported | [User Guide](docs/user_guide_mlu/README.md) |
| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Example (requires vendor support) | TBD |
| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two issues on this line:

  1. Inconsistent Communication Backend: The table lists MCCL as the communication backend for MetaX, but both the FAQ (docs/user_guide_metax/faq.md line 31) and the platform implementation (platform_cuda_metax.py line 143) state that MetaX uses standard NCCL.
  2. Broken Link: The link [User Guide](docs/user_guide_metax/README.md) points to a non-existent README.md file in the docs/user_guide_metax/ directory. It should point to docs/user_guide_metax/install_guidance.md instead.
Suggested change
| MetaX | MetaX GPUs (CUDA-compatible) | MCCL | ✅ Supported | [User Guide](docs/user_guide_metax/README.md) |
| MetaX | MetaX GPUs (CUDA-compatible) | NCCL | ✅ Supported | [User Guide](docs/user_guide_metax/install_guidance.md) |

export VERL_PLATFORM=metax
```

Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Broken Link: The link [Environment Variables Reference](./env_reference.md) points to a non-existent file ./env_reference.md in the docs/user_guide_metax/ directory. Since there is no such file, we should remove this reference to avoid a broken link.

Suggested change
Or let auto-detection handle it (requires `mx-smi` in PATH inside the container). See [Environment Variables Reference](./env_reference.md) for full configuration options.
Or let auto-detection handle it (requires mx-smi in PATH inside the container).

```bash
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python3 -c "import vllm; print('vLLM OK')"
python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Execution Error: Running ray.init(address='auto') will attempt to connect to an existing Ray cluster. Since these instructions are for setting up a new container, no Ray cluster is running yet, and this command will fail. Using ray.init() instead will start a local Ray instance and correctly verify GPU detection.

Suggested change
python3 -c "import ray; ray.init(address='auto'); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"
python3 -c "import ray; ray.init(); res=ray.cluster_resources(); print('Ray OK' if 'GPU' in res and res['GPU']>0 else 'Failed'); ray.shutdown()"


## 1. Pull the Base Image

Please download docker images from (https://developer.metax-tech.com/) or contact Metax engineers to get the docker images.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least provide one usable image name.

Comment thread docs/user_guide_metax/quick_start.md Outdated

```bash
cd /workspace/verl/examples/grpo_trainer/
bash train_qwen3-8b-8gpus.sh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heavyrain-lzy heavyrain-lzy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heavyrain-lzy heavyrain-lzy merged commit a7f2d79 into verl-project:main Jun 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants