fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe by paragao · Pull Request #1062 · awslabs/awsome-distributed-ai

paragao · 2026-04-09T09:43:56Z

Summary

Fix EFA networking on standard EKS: Add hostPath mounts for /opt/amazon/efa and /opt/amazon/ofi-nccl with an init container that creates NCCL symlinks. Required when the Docker image does not bundle EFA/OFI-NCCL libraries.
Fix data format bug in load_data_grpo.sh: Wrap the prompt column in chat message dicts ([{"role": "user", "content": "..."}]) instead of raw strings. Without this fix, verl's RLHFDataset._build_messages() tokenizes prompts to only 3 tokens and reward stays at 0.0 for the entire training run.
Add optimized GRPO recipe for p5en: New recipe/run_grpo_p5en_optimized.sh with dynamic batching, FSDP2, forward prefetch, and higher vLLM KV cache allocation.

Changes

`setup/raycluster.yaml`

Add EFA/OFI-NCCL hostPath volume mounts and nccl-symlinks init container for worker pods
Add /dev/shm emptyDir (Memory medium, 200Gi) for PyTorch shared memory and NCCL SHM transport
Add NCCL tuning env vars (NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE, NCCL_TUNER_PLUGIN) and LD_LIBRARY_PATH pointing to mounted EFA libs
Add VLLM_USE_V1, TOKENIZERS_PARALLELISM, PYTORCH_CUDA_ALLOC_CONF env vars
Remove hardcoded NCCL_PROTO=simple from workers (let the NCCL tuner plugin decide)
Templatize worker CPU/memory as PLACEHOLDER_WORKER_CPU/PLACEHOLDER_WORKER_MEMORY

`setup/env_vars.example`

Add WORKER_CPU and WORKER_MEMORY variables with guidance for p5en.48xlarge sizing

`setup/generate-kustomization.sh`

Add WORKER_CPU/WORKER_MEMORY substitutions in the kustomize worker resources patch

`setup/load_data_grpo.sh`

Bug fix: Change add_verl_columns() to wrap the question column in chat message dicts instead of passing raw strings to verl
Add -n ${RAY_NAMESPACE} to all kubectl commands

`ray-expose.sh`

Add -n ${RAY_NAMESPACE} to kubectl get service and kubectl port-forward

`recipe/run_grpo_p5en_optimized.sh` (new)

Optimized GRPO recipe for p5en.48xlarge with:
- Dynamic batching (use_dynamic_bsz=True, ppo_max_token_len_per_gpu=4096)
- FSDP2 (strategy=fsdp2)
- Forward prefetch (forward_prefetch=True)
- Higher vLLM KV cache (gpu_memory_utilization=0.7)
- PYTORCH_CUDA_ALLOC_CONF="" in runtime-env-json to prevent vLLM v1 conflict

Testing

Tested on 4x p5en.48xlarge (32 H200 GPUs) with Qwen3-8B on GSM8K using verl v0.6.1 and KubeRay v1.5.1.

…imized GRPO recipe - Add EFA/OFI-NCCL hostPath mounts and init container for NCCL symlink creation on worker pods. Required when the Docker image does not bundle EFA libraries (e.g. on standard EKS with p5/p5en nodes). - Add /dev/shm emptyDir (Memory medium, 200Gi) for PyTorch shared memory and NCCL SHM transport. - Add NCCL tuning env vars (NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE, NCCL_TUNER_PLUGIN) and LD_LIBRARY_PATH pointing to mounted EFA libs. - Add VLLM_USE_V1, TOKENIZERS_PARALLELISM, PYTORCH_CUDA_ALLOC_CONF env vars to worker pod spec. - Remove hardcoded NCCL_PROTO=simple from worker env (let tuner decide). - Templatize worker CPU/memory as PLACEHOLDER_WORKER_CPU/MEMORY with new WORKER_CPU and WORKER_MEMORY variables in env_vars.example. - Fix data format bug in load_data_grpo.sh: wrap prompt column in chat message dicts ([{role: user, content: ...}]) instead of raw strings. Without this fix, verl tokenizes prompts to 3 tokens and reward is 0.0. - Add -n $RAY_NAMESPACE to kubectl commands in load_data_grpo.sh and ray-expose.sh for namespace-aware operation. - Add WORKER_CPU/WORKER_MEMORY substitutions in generate-kustomization.sh. - Add new recipe/run_grpo_p5en_optimized.sh with dynamic batching, FSDP2, forward prefetch, and higher vLLM KV cache for p5en throughput.

mvinci12

Thanks for bundling these three fixes @paragao — the chat-message-dict wrapping for verl's RLHFDataset._build_messages(), the EFA hostPath + NCCL symlink init-container pattern, and the PYTORCH_CUDA_ALLOC_CONF="" workaround for vLLM v1 / CuMemAllocator are all hard-won bits of knowledge that plenty of folks hit. The inline "why" docstrings and comments are excellent, the namespace hygiene (-n "${RAY_NAMESPACE}" everywhere) is clean, and the test evidence (4x p5en.48xlarge / Qwen3-8B / GSM8K / verl v0.6.1 / KubeRay v1.5.1) is appreciated.

A few items before merge:

🟡 Variable ordering in `run_grpo_p5en_optimized.sh`

exp_name is set at the top of the script using ${MODEL_NAME}:

exp_name="GRPO-${MODEL_NAME}-optimized"

but MODEL_NAME's default isn't applied until much further down:

MODEL_NAME=${MODEL_NAME:-"Qwen3-8B"}

Because the script runs under set -u, if a caller invokes this without exporting MODEL_NAME first, the script will fail with "MODEL_NAME: unbound variable" before it reaches the default. Moving the env-var default block (MODEL_NAME, MODEL_PATH, RAY_DATA_HOME, etc.) above the exp_name assignment would fix this.

🟡 `busybox:latest` in the init container

initContainers:
- name: nccl-symlinks
  image: busybox:latest

Unpinned :latest is a reproducibility and supply-chain risk, and Docker Hub rate-limiting bites on large EKS clusters. Consider either pinning to a digest or switching to a pinned tag on ECR Public:

image: public.ecr.aws/docker/library/busybox:1.36

💡 Suggestion: consolidate worker CPU/memory config

WORKER_CPU / WORKER_MEMORY are introduced in two places — env_vars.example and the generate-kustomization.sh patch. It's a bit ambiguous which layer is the intended source of truth. Suggest keeping the kustomize patch as the sole definition (consistent with how the other PLACEHOLDER_* values flow) and dropping the entries from env_vars.example, or at minimum adding a comment in env_vars.example saying "these are consumed by generate-kustomization.sh and injected via the kustomize patch." Non-blocking.

❓ Relationship to `run_grpo_configurable.sh`?

Is the new run_grpo_p5en_optimized.sh intended to replace run_grpo_configurable.sh, or coexist? If coexist, a short note in the README describing when to pick each (e.g., "use configurable as the general-purpose entry point; use p5en_optimized on H200 nodes for max throughput") would help future users.

Overall a nice cleanup of real issues. Happy to re-review once the ordering fix and image pin are in.

mvinci12

see comments

paragao requested review from mvinci12 May 1, 2026 14:56

mvinci12 reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe#1062

fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe#1062
paragao wants to merge 1 commit into
mainfrom
fix/rlvr-efa-nccl-data-format

paragao commented Apr 9, 2026

Uh oh!

mvinci12 left a comment

Uh oh!

mvinci12 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paragao commented Apr 9, 2026

Summary

Changes

setup/raycluster.yaml

setup/env_vars.example

setup/generate-kustomization.sh

setup/load_data_grpo.sh

ray-expose.sh

recipe/run_grpo_p5en_optimized.sh (new)

Testing

Uh oh!

mvinci12 left a comment

Choose a reason for hiding this comment

🟡 Variable ordering in run_grpo_p5en_optimized.sh

🟡 busybox:latest in the init container

💡 Suggestion: consolidate worker CPU/memory config

❓ Relationship to run_grpo_configurable.sh?

Uh oh!

mvinci12 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`setup/raycluster.yaml`

`setup/env_vars.example`

`setup/generate-kustomization.sh`

`setup/load_data_grpo.sh`

`ray-expose.sh`

`recipe/run_grpo_p5en_optimized.sh` (new)

🟡 Variable ordering in `run_grpo_p5en_optimized.sh`

🟡 `busybox:latest` in the init container

❓ Relationship to `run_grpo_configurable.sh`?