Skip to content

feat: add Cosmos Reason vLLM inference test case#1111

Open
mvinci12 wants to merge 3 commits into
awslabs:mainfrom
mvinci12:feat/cosmos-reason-vllm-test-case
Open

feat: add Cosmos Reason vLLM inference test case#1111
mvinci12 wants to merge 3 commits into
awslabs:mainfrom
mvinci12:feat/cosmos-reason-vllm-test-case

Conversation

@mvinci12

@mvinci12 mvinci12 commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an online inference test case for NVIDIA's Cosmos Reason physical-reasoning Vision-Language Model, served by vLLM on Amazon EKS or SageMaker HyperPod EKS.

What this adds

3.test_cases/pytorch/vllm/cosmos-reason/ — 14 files, ~1300 lines.

Two parallel deployment paths (both kubectl-only)

Path Container Use case
kubernetes/ Upstream vllm/vllm-openai:v0.21.0 Plain EKS users, no HyperPod required
hyperpod-eks/ AWS-managed vLLM DLC (vllm:0.17-gpu-py312) HyperPod EKS clusters using the Inference Operator — auto KEDA + Karpenter scale-to-zero, managed KV cache, intelligent routing

Three example use cases

Script Pattern
examples/image_vqa.py Single-image visual Q&A
examples/video_qa.py Short video clip Q&A
examples/auto_label.py SDG auto-labeling with structured JSON output + retry logic

Sample data

examples/download_samples.sh fetches CC-licensed media from Unsplash and Wikimedia Commons. Sample files are gitignored.

Validation

Validated end-to-end on a SageMaker HyperPod EKS cluster:

  • Hardware: g5.8xlarge (1× NVIDIA A10G 24 GB)
  • Model: nvidia/Cosmos-Reason1-7B
  • Container: vllm/vllm-openai:v0.21.0
  • Both deployment paths exercised

Design decisions

  • Default model is Cosmos-Reason1-7B (not Reason2) because Reason2 is HF-gated
  • No HPA included — inference is GPU-bound; CPU-based scaling is misleading. README points at KEDA + vllm:num_requests_running for production.
  • TLS verification defaults to ON; --insecure flag available for self-signed certs
  • auto_label.py includes --max-retries with exponential backoff for transient HTTP errors
  • Pre-flight envsubst dry-run validation step documented in Quick Start

mvinci12 added 2 commits May 27, 2026 15:45
Adds an online inference test case for NVIDIA Cosmos Reason physical-reasoning
VLM served by vLLM on Amazon EKS and SageMaker HyperPod EKS.

- Two parallel deployment paths: vanilla EKS (kubernetes/) and HyperPod
  Inference Operator (hyperpod-eks/)
- Three example clients: image VQA, video Q&A, SDG auto-labeling
- Default model: nvidia/Cosmos-Reason1-7B (Reason2 supported via MODEL_ID swap)
- Validated empirically on g5.8xlarge / A10G 24 GB
- download_samples.sh provides CC-licensed sample media for testing
- kubernetes/deployment.yaml: add --limit-mm-per-prompt and
  --mm-processor-kwargs to expand encoder cache for video workloads;
  bump MAX_MODEL_LEN default to 24576
- hyperpod-eks/endpoint.yaml: comment out --reasoning-parser qwen3 and
  SM_VLLM_REASONING_PARSER by default (incompatible with Reason1/Qwen2.5-VL;
  uncomment for Reason2 only)
- README.md: add Validation Status section with empirical benchmark table,
  [!IMPORTANT] callout for video encoder cache sizing, extended
  Reason1-vs-Reason2 parser guidance note, and four new Troubleshooting rows

Tested end-to-end on hp-cluster-mvincig-hypd-0223-d2cp (us-west-2):
  kubernetes/ + Reason1-7B (g5.8xlarge): PASS
  kubernetes/ + Reason2-8B (g6.12xlarge TP=4): PASS
  hyperpod-eks/ + Reason1-7B (ml.g5.8xlarge): PASS (image+label)
  hyperpod-eks/ + Reason2-8B (ml.g6.12xlarge TP=4): PASS (image+label)
@mvinci12 mvinci12 requested a review from bluecrayon52 May 27, 2026 23:15
@mvinci12 mvinci12 marked this pull request as ready for review May 27, 2026 23:15
@mvinci12 mvinci12 marked this pull request as draft May 27, 2026 23:16
…nor cleanups

Major fixes
- kubernetes/README.md: remove stale HorizontalPodAutoscaler from the file
  table and intro. The deployment does not include an HPA; CPU-based scaling
  is not a useful proxy for GPU-bound inference. KEDA on
  vllm:num_requests_running is recommended in the body.
- README.md, hyperpod-eks/README.md: align on vllm/vllm-openai:v0.21.0 for
  the kubernetes/ path. Clarify that the AWS DLC vllm:0.17-gpu-py312
  corresponds to vLLM 0.17.1 (separate from the upstream image).
- env_vars.example, README.md: bump default MAX_MODEL_LEN to 24576 so the
  sample video clip works out of the box on 24 GB GPUs. Document fallback to
  8192 for tight VRAM.

Minor cleanups
- env_vars.example, kubernetes/deployment.yaml: remove dead REASONING_PARSER
  and MEDIA_IO_KWARGS env-var pattern (envsubst inside YAML comments never
  reaches the vLLM CLI). Replace with explicit manual-edit instructions for
  Reason2 users.
- hyperpod-eks/hf-token-secret.yaml.example,
  kubernetes/hf-token-secret.yaml.example: use ${NAMESPACE} for consistency
  with all other manifests.
- examples/auto_label.py: drop unused base64 import.
@mvinci12 mvinci12 marked this pull request as ready for review May 27, 2026 23:28
@KeitaW KeitaW requested a review from allela-roy May 28, 2026 08:52

@bluecrayon52 bluecrayon52 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the 14 files under cosmos-reason/. Solid test case — clean two-path structure, the validation matrix is genuinely useful, secrets handling is correct (kubectl create secret default, no committed tokens), and TLS defaults to verify-on. A few items below; only the port one is functional. Not blocking.

Heads-up: this branch's merge-base is behind current main — worth a rebase before merge.

- "--host"
- "0.0.0.0"
- "--port"
- "${INVOCATION_PORT}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--port uses ${INVOCATION_PORT} but containerPort (93), both probes (111/119), and the Service (145-146) hardcode 8000. If anyone sets INVOCATION_PORT to a non-8000 value, vLLM moves but the probes/Service don't follow — readiness fails and traffic black-holes. Either thread ${INVOCATION_PORT} through all of them, or drop the variable here and hardcode 8000 consistently. The env default (8000) masks this in the happy path.

import sys
import time
from pathlib import Path
from typing import Optional, Tuple

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional and Tuple are imported but unused here (the parser is reused from image_vqa). The repo's PR workflow runs flake8 on all Python files, so these will show up as F401 in the automated lint report — worth cleaning up, though it won't fail the build.

| `hyperpod-eks/` | Reason1-7B | ml.g5.8xlarge (A10G, TP=1) | 21.3 s | unsupported¹ | 18.1 s |
| `hyperpod-eks/` | Reason2-8B | ml.g6.12xlarge (4× L4, TP=4) | 13.8 s | unsupported¹ | 4.5 s |

¹ The AWS vLLM DLC `vllm:0.17-gpu-py312` (vLLM 0.17.1) does not expose

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small inconsistency on the DLC version: this tag is described as vLLM 0.17.1 here and at line 238, but as 0.17.0 in env_vars.example:30 and hyperpod-eks/README.md:146. Might be worth aligning these to a single version.

| Variable | Default | Purpose |
|----------|---------|---------|
| `MODEL_ID` | `nvidia/Cosmos-Reason1-7B` | HF model ID. Override to `Cosmos-Reason2-8B` on L40S/H100. |
| `IMAGE_TAG` (kubernetes) | `vllm/vllm-openai:v0.21.0` | Upstream vLLM container. Pin to a specific version, never `:latest`. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Configuration Reference lists IMAGE_TAG, but the actual env vars are VLLM_IMAGE_VANILLA and VLLM_IMAGE_AWS_DLC (env_vars.example:27/31). IMAGE_TAG doesn't exist anywhere — a user grepping for it finds nothing.


- HyperPod EKS cluster with at least one GPU node
- HyperPod Inference Operator installed:
- Helm chart `hyperpod-inference-operator` v2.1.1, image `v3.1`, OR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version references span multiple install paths: this says Helm chart v2.1.1, the parent README says image v3.1 / CLI v3.7.1+, and the EKS add-on you recommend at line 25 versions as vX.Y.Z-eksbuild.N (currently v1.2.1-eksbuild.1). Since the add-on is the recommended path, consider documenting its version scheme too, and labeling which number belongs to which path (add-on vs Helm vs CLI). Could you also confirm v2.1.1 / v3.1 / v3.7.1 are current for the Helm/CLI install paths?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants