fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe#1062
fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe#1062paragao wants to merge 1 commit into
Conversation
…imized GRPO recipe
- Add EFA/OFI-NCCL hostPath mounts and init container for NCCL symlink
creation on worker pods. Required when the Docker image does not bundle
EFA libraries (e.g. on standard EKS with p5/p5en nodes).
- Add /dev/shm emptyDir (Memory medium, 200Gi) for PyTorch shared memory
and NCCL SHM transport.
- Add NCCL tuning env vars (NCCL_BUFFSIZE, NCCL_P2P_NET_CHUNKSIZE,
NCCL_TUNER_PLUGIN) and LD_LIBRARY_PATH pointing to mounted EFA libs.
- Add VLLM_USE_V1, TOKENIZERS_PARALLELISM, PYTORCH_CUDA_ALLOC_CONF env
vars to worker pod spec.
- Remove hardcoded NCCL_PROTO=simple from worker env (let tuner decide).
- Templatize worker CPU/memory as PLACEHOLDER_WORKER_CPU/MEMORY with new
WORKER_CPU and WORKER_MEMORY variables in env_vars.example.
- Fix data format bug in load_data_grpo.sh: wrap prompt column in chat
message dicts ([{role: user, content: ...}]) instead of raw strings.
Without this fix, verl tokenizes prompts to 3 tokens and reward is 0.0.
- Add -n $RAY_NAMESPACE to kubectl commands in load_data_grpo.sh and
ray-expose.sh for namespace-aware operation.
- Add WORKER_CPU/WORKER_MEMORY substitutions in generate-kustomization.sh.
- Add new recipe/run_grpo_p5en_optimized.sh with dynamic batching, FSDP2,
forward prefetch, and higher vLLM KV cache for p5en throughput.
mvinci12
left a comment
There was a problem hiding this comment.
Thanks for bundling these three fixes @paragao — the chat-message-dict wrapping for verl's RLHFDataset._build_messages(), the EFA hostPath + NCCL symlink init-container pattern, and the PYTORCH_CUDA_ALLOC_CONF="" workaround for vLLM v1 / CuMemAllocator are all hard-won bits of knowledge that plenty of folks hit. The inline "why" docstrings and comments are excellent, the namespace hygiene (-n "${RAY_NAMESPACE}" everywhere) is clean, and the test evidence (4x p5en.48xlarge / Qwen3-8B / GSM8K / verl v0.6.1 / KubeRay v1.5.1) is appreciated.
A few items before merge:
🟡 Variable ordering in run_grpo_p5en_optimized.sh
exp_name is set at the top of the script using ${MODEL_NAME}:
exp_name="GRPO-${MODEL_NAME}-optimized"but MODEL_NAME's default isn't applied until much further down:
MODEL_NAME=${MODEL_NAME:-"Qwen3-8B"}Because the script runs under set -u, if a caller invokes this without exporting MODEL_NAME first, the script will fail with "MODEL_NAME: unbound variable" before it reaches the default. Moving the env-var default block (MODEL_NAME, MODEL_PATH, RAY_DATA_HOME, etc.) above the exp_name assignment would fix this.
🟡 busybox:latest in the init container
initContainers:
- name: nccl-symlinks
image: busybox:latestUnpinned :latest is a reproducibility and supply-chain risk, and Docker Hub rate-limiting bites on large EKS clusters. Consider either pinning to a digest or switching to a pinned tag on ECR Public:
image: public.ecr.aws/docker/library/busybox:1.36💡 Suggestion: consolidate worker CPU/memory config
WORKER_CPU / WORKER_MEMORY are introduced in two places — env_vars.example and the generate-kustomization.sh patch. It's a bit ambiguous which layer is the intended source of truth. Suggest keeping the kustomize patch as the sole definition (consistent with how the other PLACEHOLDER_* values flow) and dropping the entries from env_vars.example, or at minimum adding a comment in env_vars.example saying "these are consumed by generate-kustomization.sh and injected via the kustomize patch." Non-blocking.
❓ Relationship to run_grpo_configurable.sh?
Is the new run_grpo_p5en_optimized.sh intended to replace run_grpo_configurable.sh, or coexist? If coexist, a short note in the README describing when to pick each (e.g., "use configurable as the general-purpose entry point; use p5en_optimized on H200 nodes for max throughput") would help future users.
Overall a nice cleanup of real issues. Happy to re-review once the ordering fix and image pin are in.
Summary
/opt/amazon/efaand/opt/amazon/ofi-ncclwith an init container that creates NCCL symlinks. Required when the Docker image does not bundle EFA/OFI-NCCL libraries.load_data_grpo.sh: Wrap the prompt column in chat message dicts ([{"role": "user", "content": "..."}]) instead of raw strings. Without this fix, verl'sRLHFDataset._build_messages()tokenizes prompts to only 3 tokens and reward stays at 0.0 for the entire training run.recipe/run_grpo_p5en_optimized.shwith dynamic batching, FSDP2, forward prefetch, and higher vLLM KV cache allocation.Changes
setup/raycluster.yamlnccl-symlinksinit container for worker pods/dev/shmemptyDir (Memory medium, 200Gi) for PyTorch shared memory and NCCL SHM transportNCCL_BUFFSIZE,NCCL_P2P_NET_CHUNKSIZE,NCCL_TUNER_PLUGIN) andLD_LIBRARY_PATHpointing to mounted EFA libsVLLM_USE_V1,TOKENIZERS_PARALLELISM,PYTORCH_CUDA_ALLOC_CONFenv varsNCCL_PROTO=simplefrom workers (let the NCCL tuner plugin decide)PLACEHOLDER_WORKER_CPU/PLACEHOLDER_WORKER_MEMORYsetup/env_vars.exampleWORKER_CPUandWORKER_MEMORYvariables with guidance for p5en.48xlarge sizingsetup/generate-kustomization.shWORKER_CPU/WORKER_MEMORYsubstitutions in the kustomize worker resources patchsetup/load_data_grpo.shadd_verl_columns()to wrap thequestioncolumn in chat message dicts instead of passing raw strings to verl-n ${RAY_NAMESPACE}to allkubectlcommandsray-expose.sh-n ${RAY_NAMESPACE}tokubectl get serviceandkubectl port-forwardrecipe/run_grpo_p5en_optimized.sh(new)use_dynamic_bsz=True,ppo_max_token_len_per_gpu=4096)strategy=fsdp2)forward_prefetch=True)gpu_memory_utilization=0.7)PYTORCH_CUDA_ALLOC_CONF=""in runtime-env-json to prevent vLLM v1 conflictTesting
Tested on 4x p5en.48xlarge (32 H200 GPUs) with Qwen3-8B on GSM8K using verl v0.6.1 and KubeRay v1.5.1.