Skip to content

fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe#1062

Open
paragao wants to merge 1 commit into
mainfrom
fix/rlvr-efa-nccl-data-format
Open

fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe#1062
paragao wants to merge 1 commit into
mainfrom
fix/rlvr-efa-nccl-data-format

Conversation

@paragao

@paragao paragao commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix EFA networking on standard EKS: Add hostPath mounts for /opt/amazon/efa and /opt/amazon/ofi-nccl with an init container that creates NCCL symlinks. Required when the Docker image does not bundle EFA/OFI-NCCL libraries.
  • Fix data format bug in load_data_grpo.sh: Wrap the prompt column in chat message dicts ([{"role": "user", "content": "..."}]) instead of raw strings. Without this fix, verl's RLHFDataset._build_messages() tokenizes prompts to only 3 tokens and reward stays at 0.0 for the entire training run.
  • Add optimized GRPO recipe for p5en: New recipe/run_grpo_p5en_optimized.sh with dynamic batching, FSDP2, forward prefetch, and higher vLLM KV cache allocation.

Changes

setup/raycluster.yaml

  • Add EFA/OFI-NCCL hostPath volume mounts and nccl-symlinks init container for worker pods
  • Add /dev/shm emptyDir (Memory medium, 200Gi) for PyTorch shared memory and NCCL SHM transport
  • Add NCCL tuning env vars (NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE, NCCL_TUNER_PLUGIN) and LD_LIBRARY_PATH pointing to mounted EFA libs
  • Add VLLM_USE_V1, TOKENIZERS_PARALLELISM, PYTORCH_CUDA_ALLOC_CONF env vars
  • Remove hardcoded NCCL_PROTO=simple from workers (let the NCCL tuner plugin decide)
  • Templatize worker CPU/memory as PLACEHOLDER_WORKER_CPU/PLACEHOLDER_WORKER_MEMORY

setup/env_vars.example

  • Add WORKER_CPU and WORKER_MEMORY variables with guidance for p5en.48xlarge sizing

setup/generate-kustomization.sh

  • Add WORKER_CPU/WORKER_MEMORY substitutions in the kustomize worker resources patch

setup/load_data_grpo.sh

  • Bug fix: Change add_verl_columns() to wrap the question column in chat message dicts instead of passing raw strings to verl
  • Add -n ${RAY_NAMESPACE} to all kubectl commands

ray-expose.sh

  • Add -n ${RAY_NAMESPACE} to kubectl get service and kubectl port-forward

recipe/run_grpo_p5en_optimized.sh (new)

  • Optimized GRPO recipe for p5en.48xlarge with:
    • Dynamic batching (use_dynamic_bsz=True, ppo_max_token_len_per_gpu=4096)
    • FSDP2 (strategy=fsdp2)
    • Forward prefetch (forward_prefetch=True)
    • Higher vLLM KV cache (gpu_memory_utilization=0.7)
    • PYTORCH_CUDA_ALLOC_CONF="" in runtime-env-json to prevent vLLM v1 conflict

Testing

Tested on 4x p5en.48xlarge (32 H200 GPUs) with Qwen3-8B on GSM8K using verl v0.6.1 and KubeRay v1.5.1.

…imized GRPO recipe

- Add EFA/OFI-NCCL hostPath mounts and init container for NCCL symlink
  creation on worker pods. Required when the Docker image does not bundle
  EFA libraries (e.g. on standard EKS with p5/p5en nodes).
- Add /dev/shm emptyDir (Memory medium, 200Gi) for PyTorch shared memory
  and NCCL SHM transport.
- Add NCCL tuning env vars (NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE,
  NCCL_TUNER_PLUGIN) and LD_LIBRARY_PATH pointing to mounted EFA libs.
- Add VLLM_USE_V1, TOKENIZERS_PARALLELISM, PYTORCH_CUDA_ALLOC_CONF env
  vars to worker pod spec.
- Remove hardcoded NCCL_PROTO=simple from worker env (let tuner decide).
- Templatize worker CPU/memory as PLACEHOLDER_WORKER_CPU/MEMORY with new
  WORKER_CPU and WORKER_MEMORY variables in env_vars.example.
- Fix data format bug in load_data_grpo.sh: wrap prompt column in chat
  message dicts ([{role: user, content: ...}]) instead of raw strings.
  Without this fix, verl tokenizes prompts to 3 tokens and reward is 0.0.
- Add -n $RAY_NAMESPACE to kubectl commands in load_data_grpo.sh and
  ray-expose.sh for namespace-aware operation.
- Add WORKER_CPU/WORKER_MEMORY substitutions in generate-kustomization.sh.
- Add new recipe/run_grpo_p5en_optimized.sh with dynamic batching, FSDP2,
  forward prefetch, and higher vLLM KV cache for p5en throughput.
@paragao paragao requested review from mvinci12 May 1, 2026 14:56

@mvinci12 mvinci12 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bundling these three fixes @paragao — the chat-message-dict wrapping for verl's RLHFDataset._build_messages(), the EFA hostPath + NCCL symlink init-container pattern, and the PYTORCH_CUDA_ALLOC_CONF="" workaround for vLLM v1 / CuMemAllocator are all hard-won bits of knowledge that plenty of folks hit. The inline "why" docstrings and comments are excellent, the namespace hygiene (-n "${RAY_NAMESPACE}" everywhere) is clean, and the test evidence (4x p5en.48xlarge / Qwen3-8B / GSM8K / verl v0.6.1 / KubeRay v1.5.1) is appreciated.

A few items before merge:

🟡 Variable ordering in run_grpo_p5en_optimized.sh

exp_name is set at the top of the script using ${MODEL_NAME}:

exp_name="GRPO-${MODEL_NAME}-optimized"

but MODEL_NAME's default isn't applied until much further down:

MODEL_NAME=${MODEL_NAME:-"Qwen3-8B"}

Because the script runs under set -u, if a caller invokes this without exporting MODEL_NAME first, the script will fail with "MODEL_NAME: unbound variable" before it reaches the default. Moving the env-var default block (MODEL_NAME, MODEL_PATH, RAY_DATA_HOME, etc.) above the exp_name assignment would fix this.

🟡 busybox:latest in the init container

initContainers:
- name: nccl-symlinks
  image: busybox:latest

Unpinned :latest is a reproducibility and supply-chain risk, and Docker Hub rate-limiting bites on large EKS clusters. Consider either pinning to a digest or switching to a pinned tag on ECR Public:

image: public.ecr.aws/docker/library/busybox:1.36

💡 Suggestion: consolidate worker CPU/memory config

WORKER_CPU / WORKER_MEMORY are introduced in two places — env_vars.example and the generate-kustomization.sh patch. It's a bit ambiguous which layer is the intended source of truth. Suggest keeping the kustomize patch as the sole definition (consistent with how the other PLACEHOLDER_* values flow) and dropping the entries from env_vars.example, or at minimum adding a comment in env_vars.example saying "these are consumed by generate-kustomization.sh and injected via the kustomize patch." Non-blocking.

❓ Relationship to run_grpo_configurable.sh?

Is the new run_grpo_p5en_optimized.sh intended to replace run_grpo_configurable.sh, or coexist? If coexist, a short note in the README describing when to pick each (e.g., "use configurable as the general-purpose entry point; use p5en_optimized on H200 nodes for max throughput") would help future users.


Overall a nice cleanup of real issues. Happy to re-review once the ordering fix and image pin are in.

@mvinci12 mvinci12 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants