feat: add Cosmos Reason vLLM inference test case#1111
Conversation
Adds an online inference test case for NVIDIA Cosmos Reason physical-reasoning VLM served by vLLM on Amazon EKS and SageMaker HyperPod EKS. - Two parallel deployment paths: vanilla EKS (kubernetes/) and HyperPod Inference Operator (hyperpod-eks/) - Three example clients: image VQA, video Q&A, SDG auto-labeling - Default model: nvidia/Cosmos-Reason1-7B (Reason2 supported via MODEL_ID swap) - Validated empirically on g5.8xlarge / A10G 24 GB - download_samples.sh provides CC-licensed sample media for testing
- kubernetes/deployment.yaml: add --limit-mm-per-prompt and --mm-processor-kwargs to expand encoder cache for video workloads; bump MAX_MODEL_LEN default to 24576 - hyperpod-eks/endpoint.yaml: comment out --reasoning-parser qwen3 and SM_VLLM_REASONING_PARSER by default (incompatible with Reason1/Qwen2.5-VL; uncomment for Reason2 only) - README.md: add Validation Status section with empirical benchmark table, [!IMPORTANT] callout for video encoder cache sizing, extended Reason1-vs-Reason2 parser guidance note, and four new Troubleshooting rows Tested end-to-end on hp-cluster-mvincig-hypd-0223-d2cp (us-west-2): kubernetes/ + Reason1-7B (g5.8xlarge): PASS kubernetes/ + Reason2-8B (g6.12xlarge TP=4): PASS hyperpod-eks/ + Reason1-7B (ml.g5.8xlarge): PASS (image+label) hyperpod-eks/ + Reason2-8B (ml.g6.12xlarge TP=4): PASS (image+label)
…nor cleanups
Major fixes
- kubernetes/README.md: remove stale HorizontalPodAutoscaler from the file
table and intro. The deployment does not include an HPA; CPU-based scaling
is not a useful proxy for GPU-bound inference. KEDA on
vllm:num_requests_running is recommended in the body.
- README.md, hyperpod-eks/README.md: align on vllm/vllm-openai:v0.21.0 for
the kubernetes/ path. Clarify that the AWS DLC vllm:0.17-gpu-py312
corresponds to vLLM 0.17.1 (separate from the upstream image).
- env_vars.example, README.md: bump default MAX_MODEL_LEN to 24576 so the
sample video clip works out of the box on 24 GB GPUs. Document fallback to
8192 for tight VRAM.
Minor cleanups
- env_vars.example, kubernetes/deployment.yaml: remove dead REASONING_PARSER
and MEDIA_IO_KWARGS env-var pattern (envsubst inside YAML comments never
reaches the vLLM CLI). Replace with explicit manual-edit instructions for
Reason2 users.
- hyperpod-eks/hf-token-secret.yaml.example,
kubernetes/hf-token-secret.yaml.example: use ${NAMESPACE} for consistency
with all other manifests.
- examples/auto_label.py: drop unused base64 import.
bluecrayon52
left a comment
There was a problem hiding this comment.
Reviewed the 14 files under cosmos-reason/. Solid test case — clean two-path structure, the validation matrix is genuinely useful, secrets handling is correct (kubectl create secret default, no committed tokens), and TLS defaults to verify-on. A few items below; only the port one is functional. Not blocking.
Heads-up: this branch's merge-base is behind current main — worth a rebase before merge.
| - "--host" | ||
| - "0.0.0.0" | ||
| - "--port" | ||
| - "${INVOCATION_PORT}" |
There was a problem hiding this comment.
--port uses ${INVOCATION_PORT} but containerPort (93), both probes (111/119), and the Service (145-146) hardcode 8000. If anyone sets INVOCATION_PORT to a non-8000 value, vLLM moves but the probes/Service don't follow — readiness fails and traffic black-holes. Either thread ${INVOCATION_PORT} through all of them, or drop the variable here and hardcode 8000 consistently. The env default (8000) masks this in the happy path.
| import sys | ||
| import time | ||
| from pathlib import Path | ||
| from typing import Optional, Tuple |
There was a problem hiding this comment.
Optional and Tuple are imported but unused here (the parser is reused from image_vqa). The repo's PR workflow runs flake8 on all Python files, so these will show up as F401 in the automated lint report — worth cleaning up, though it won't fail the build.
| | `hyperpod-eks/` | Reason1-7B | ml.g5.8xlarge (A10G, TP=1) | 21.3 s | unsupported¹ | 18.1 s | | ||
| | `hyperpod-eks/` | Reason2-8B | ml.g6.12xlarge (4× L4, TP=4) | 13.8 s | unsupported¹ | 4.5 s | | ||
|
|
||
| ¹ The AWS vLLM DLC `vllm:0.17-gpu-py312` (vLLM 0.17.1) does not expose |
There was a problem hiding this comment.
Small inconsistency on the DLC version: this tag is described as vLLM 0.17.1 here and at line 238, but as 0.17.0 in env_vars.example:30 and hyperpod-eks/README.md:146. Might be worth aligning these to a single version.
| | Variable | Default | Purpose | | ||
| |----------|---------|---------| | ||
| | `MODEL_ID` | `nvidia/Cosmos-Reason1-7B` | HF model ID. Override to `Cosmos-Reason2-8B` on L40S/H100. | | ||
| | `IMAGE_TAG` (kubernetes) | `vllm/vllm-openai:v0.21.0` | Upstream vLLM container. Pin to a specific version, never `:latest`. | |
There was a problem hiding this comment.
The Configuration Reference lists IMAGE_TAG, but the actual env vars are VLLM_IMAGE_VANILLA and VLLM_IMAGE_AWS_DLC (env_vars.example:27/31). IMAGE_TAG doesn't exist anywhere — a user grepping for it finds nothing.
|
|
||
| - HyperPod EKS cluster with at least one GPU node | ||
| - HyperPod Inference Operator installed: | ||
| - Helm chart `hyperpod-inference-operator` v2.1.1, image `v3.1`, OR |
There was a problem hiding this comment.
Version references span multiple install paths: this says Helm chart v2.1.1, the parent README says image v3.1 / CLI v3.7.1+, and the EKS add-on you recommend at line 25 versions as vX.Y.Z-eksbuild.N (currently v1.2.1-eksbuild.1). Since the add-on is the recommended path, consider documenting its version scheme too, and labeling which number belongs to which path (add-on vs Helm vs CLI). Could you also confirm v2.1.1 / v3.1 / v3.7.1 are current for the Helm/CLI install paths?
Summary
Adds an online inference test case for NVIDIA's Cosmos Reason physical-reasoning Vision-Language Model, served by vLLM on Amazon EKS or SageMaker HyperPod EKS.
What this adds
3.test_cases/pytorch/vllm/cosmos-reason/— 14 files, ~1300 lines.Two parallel deployment paths (both
kubectl-only)kubernetes/vllm/vllm-openai:v0.21.0hyperpod-eks/vllm:0.17-gpu-py312)Three example use cases
examples/image_vqa.pyexamples/video_qa.pyexamples/auto_label.pySample data
examples/download_samples.shfetches CC-licensed media from Unsplash and Wikimedia Commons. Sample files are gitignored.Validation
Validated end-to-end on a SageMaker HyperPod EKS cluster:
nvidia/Cosmos-Reason1-7Bvllm/vllm-openai:v0.21.0Design decisions
vllm:num_requests_runningfor production.--insecureflag available for self-signed certsauto_label.pyincludes--max-retrieswith exponential backoff for transient HTTP errorsenvsubstdry-run validation step documented in Quick Start