feat: add Cosmos Reason vLLM inference test case by mvinci12 · Pull Request #1111 · awslabs/awsome-distributed-ai

mvinci12 · 2026-05-27T19:45:38Z

Summary

Adds an online inference test case for NVIDIA's Cosmos Reason physical-reasoning Vision-Language Model, served by vLLM on Amazon EKS or SageMaker HyperPod EKS.

What this adds

3.test_cases/pytorch/vllm/cosmos-reason/ — 14 files, ~1300 lines.

Two parallel deployment paths (both `kubectl`-only)

Path	Container	Use case
`kubernetes/`	Upstream `vllm/vllm-openai:v0.21.0`	Plain EKS users, no HyperPod required
`hyperpod-eks/`	AWS-managed vLLM DLC (`vllm:0.17-gpu-py312`)	HyperPod EKS clusters using the Inference Operator — auto KEDA + Karpenter scale-to-zero, managed KV cache, intelligent routing

Three example use cases

Script	Pattern
`examples/image_vqa.py`	Single-image visual Q&A
`examples/video_qa.py`	Short video clip Q&A
`examples/auto_label.py`	SDG auto-labeling with structured JSON output + retry logic

Sample data

examples/download_samples.sh fetches CC-licensed media from Unsplash and Wikimedia Commons. Sample files are gitignored.

Validation

Validated end-to-end on a SageMaker HyperPod EKS cluster:

Hardware: g5.8xlarge (1× NVIDIA A10G 24 GB)
Model: nvidia/Cosmos-Reason1-7B
Container: vllm/vllm-openai:v0.21.0
Both deployment paths exercised

Design decisions

Default model is Cosmos-Reason1-7B (not Reason2) because Reason2 is HF-gated
No HPA included — inference is GPU-bound; CPU-based scaling is misleading. README points at KEDA + vllm:num_requests_running for production.
TLS verification defaults to ON; --insecure flag available for self-signed certs
auto_label.py includes --max-retries with exponential backoff for transient HTTP errors
Pre-flight envsubst dry-run validation step documented in Quick Start

Adds an online inference test case for NVIDIA Cosmos Reason physical-reasoning VLM served by vLLM on Amazon EKS and SageMaker HyperPod EKS. - Two parallel deployment paths: vanilla EKS (kubernetes/) and HyperPod Inference Operator (hyperpod-eks/) - Three example clients: image VQA, video Q&A, SDG auto-labeling - Default model: nvidia/Cosmos-Reason1-7B (Reason2 supported via MODEL_ID swap) - Validated empirically on g5.8xlarge / A10G 24 GB - download_samples.sh provides CC-licensed sample media for testing

- kubernetes/deployment.yaml: add --limit-mm-per-prompt and --mm-processor-kwargs to expand encoder cache for video workloads; bump MAX_MODEL_LEN default to 24576 - hyperpod-eks/endpoint.yaml: comment out --reasoning-parser qwen3 and SM_VLLM_REASONING_PARSER by default (incompatible with Reason1/Qwen2.5-VL; uncomment for Reason2 only) - README.md: add Validation Status section with empirical benchmark table, [!IMPORTANT] callout for video encoder cache sizing, extended Reason1-vs-Reason2 parser guidance note, and four new Troubleshooting rows Tested end-to-end on hp-cluster-mvincig-hypd-0223-d2cp (us-west-2): kubernetes/ + Reason1-7B (g5.8xlarge): PASS kubernetes/ + Reason2-8B (g6.12xlarge TP=4): PASS hyperpod-eks/ + Reason1-7B (ml.g5.8xlarge): PASS (image+label) hyperpod-eks/ + Reason2-8B (ml.g6.12xlarge TP=4): PASS (image+label)

…nor cleanups Major fixes - kubernetes/README.md: remove stale HorizontalPodAutoscaler from the file table and intro. The deployment does not include an HPA; CPU-based scaling is not a useful proxy for GPU-bound inference. KEDA on vllm:num_requests_running is recommended in the body. - README.md, hyperpod-eks/README.md: align on vllm/vllm-openai:v0.21.0 for the kubernetes/ path. Clarify that the AWS DLC vllm:0.17-gpu-py312 corresponds to vLLM 0.17.1 (separate from the upstream image). - env_vars.example, README.md: bump default MAX_MODEL_LEN to 24576 so the sample video clip works out of the box on 24 GB GPUs. Document fallback to 8192 for tight VRAM. Minor cleanups - env_vars.example, kubernetes/deployment.yaml: remove dead REASONING_PARSER and MEDIA_IO_KWARGS env-var pattern (envsubst inside YAML comments never reaches the vLLM CLI). Replace with explicit manual-edit instructions for Reason2 users. - hyperpod-eks/hf-token-secret.yaml.example, kubernetes/hf-token-secret.yaml.example: use ${NAMESPACE} for consistency with all other manifests. - examples/auto_label.py: drop unused base64 import.

bluecrayon52

Reviewed the 14 files under cosmos-reason/. Solid test case — clean two-path structure, the validation matrix is genuinely useful, secrets handling is correct (kubectl create secret default, no committed tokens), and TLS defaults to verify-on. A few items below; only the port one is functional. Not blocking.

Heads-up: this branch's merge-base is behind current main — worth a rebase before merge.

bluecrayon52 · 2026-06-05T20:42:26Z

+            - "--host"
+            - "0.0.0.0"
+            - "--port"
+            - "${INVOCATION_PORT}"


--port uses ${INVOCATION_PORT} but containerPort (93), both probes (111/119), and the Service (145-146) hardcode 8000. If anyone sets INVOCATION_PORT to a non-8000 value, vLLM moves but the probes/Service don't follow — readiness fails and traffic black-holes. Either thread ${INVOCATION_PORT} through all of them, or drop the variable here and hardcode 8000 consistently. The env default (8000) masks this in the happy path.

bluecrayon52 · 2026-06-05T20:42:26Z

+import sys
+import time
+from pathlib import Path
+from typing import Optional, Tuple


Optional and Tuple are imported but unused here (the parser is reused from image_vqa). The repo's PR workflow runs flake8 on all Python files, so these will show up as F401 in the automated lint report — worth cleaning up, though it won't fail the build.

bluecrayon52 · 2026-06-05T20:42:26Z

+| `hyperpod-eks/` | Reason1-7B | ml.g5.8xlarge (A10G, TP=1) | 21.3 s | unsupported¹ | 18.1 s |
+| `hyperpod-eks/` | Reason2-8B | ml.g6.12xlarge (4× L4, TP=4) | 13.8 s | unsupported¹ | 4.5 s |
+
+¹ The AWS vLLM DLC `vllm:0.17-gpu-py312` (vLLM 0.17.1) does not expose


Small inconsistency on the DLC version: this tag is described as vLLM 0.17.1 here and at line 238, but as 0.17.0 in env_vars.example:30 and hyperpod-eks/README.md:146. Might be worth aligning these to a single version.

bluecrayon52 · 2026-06-05T20:42:26Z

+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `MODEL_ID` | `nvidia/Cosmos-Reason1-7B` | HF model ID. Override to `Cosmos-Reason2-8B` on L40S/H100. |
+| `IMAGE_TAG` (kubernetes) | `vllm/vllm-openai:v0.21.0` | Upstream vLLM container. Pin to a specific version, never `:latest`. |


The Configuration Reference lists IMAGE_TAG, but the actual env vars are VLLM_IMAGE_VANILLA and VLLM_IMAGE_AWS_DLC (env_vars.example:27/31). IMAGE_TAG doesn't exist anywhere — a user grepping for it finds nothing.

bluecrayon52 · 2026-06-05T20:42:26Z

+
+- HyperPod EKS cluster with at least one GPU node
+- HyperPod Inference Operator installed:
+  - Helm chart `hyperpod-inference-operator` v2.1.1, image `v3.1`, OR


Version references span multiple install paths: this says Helm chart v2.1.1, the parent README says image v3.1 / CLI v3.7.1+, and the EKS add-on you recommend at line 25 versions as vX.Y.Z-eksbuild.N (currently v1.2.1-eksbuild.1). Since the add-on is the recommended path, consider documenting its version scheme too, and labeling which number belongs to which path (add-on vs Helm vs CLI). Could you also confirm v2.1.1 / v3.1 / v3.7.1 are current for the Helm/CLI install paths?

mvinci12 added 2 commits May 27, 2026 15:45

mvinci12 requested a review from bluecrayon52 May 27, 2026 23:15

mvinci12 marked this pull request as ready for review May 27, 2026 23:15

mvinci12 marked this pull request as draft May 27, 2026 23:16

mvinci12 marked this pull request as ready for review May 27, 2026 23:28

KeitaW requested a review from allela-roy May 28, 2026 08:52

bluecrayon52 reviewed Jun 5, 2026

View reviewed changes

bluecrayon52 requested changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Cosmos Reason vLLM inference test case#1111

feat: add Cosmos Reason vLLM inference test case#1111
mvinci12 wants to merge 3 commits into
awslabs:mainfrom
mvinci12:feat/cosmos-reason-vllm-test-case

mvinci12 commented May 27, 2026 •

edited

Loading

Uh oh!

bluecrayon52 left a comment

Uh oh!

bluecrayon52 Jun 5, 2026

Uh oh!

bluecrayon52 Jun 5, 2026

Uh oh!

bluecrayon52 Jun 5, 2026

Uh oh!

bluecrayon52 Jun 5, 2026

Uh oh!

bluecrayon52 Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvinci12 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this adds

Two parallel deployment paths (both kubectl-only)

Three example use cases

Sample data

Validation

Design decisions

Uh oh!

bluecrayon52 left a comment

Choose a reason for hiding this comment

Uh oh!

bluecrayon52 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

bluecrayon52 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

bluecrayon52 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

bluecrayon52 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

bluecrayon52 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mvinci12 commented May 27, 2026 •

edited

Loading

Two parallel deployment paths (both `kubectl`-only)