Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
250 changes: 250 additions & 0 deletions 3.test_cases/pytorch/vllm/cosmos-reason/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
# NVIDIA Cosmos Reason — vLLM Inference on Amazon EKS / HyperPod

Online inference reference for NVIDIA's [Cosmos Reason](https://www.nvidia.com/en-us/ai/cosmos/)
physical-reasoning Vision-Language Model, served by [vLLM](https://github.com/vllm-project/vllm)
on Amazon EKS or SageMaker HyperPod EKS.

Cosmos Reason is the physical-reasoning member of NVIDIA's Cosmos World Foundation Model
platform. It generates chain-of-thought reasoning traces about safety, causality, object
interactions, and spatiotemporal dynamics in videos and images — the building block for
auto-labeling Synthetic Data Generation (SDG) pipelines
([as adopted by Uber per NVIDIA SIGGRAPH 2025](https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/)),
Reasoning VLA models, and video Q&A workloads.

## Production Use Cases

| Use Case | Pattern | Example |
|----------|---------|---------|
| AV training data labeling | SDG critic loop — auto-label scenes with structured metadata (objects, hazards, weather) | [AV data captioning (NVIDIA SIGGRAPH 2025)](https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/) |
| Video scene understanding | Short clip Q&A with chain-of-thought physical reasoning | Dashcam review, warehouse safety monitoring |
| Image VQA with physical reasoning | Single-image spatial/causal analysis | Content moderation, quality inspection |
| SDG critic / verifier | Judge whether Cosmos Predict outputs are physically plausible before entering training sets | Synthetic data filtering |
| RL reward model | Score trajectories for physical coherence as a reward signal in RLHF or model-based RL | Robotics policy training, AV planning |
| Offline RL annotation | Label trajectory data with reasoning traces for offline RL training | Decision Transformer reward labels |

## Two Paths
This test case provides **two parallel deployment paths**, both `kubectl`-only:

```
Cosmos Reason on EKS
|
+---------------+---------------+
| |
kubernetes/ hyperpod-eks/
(vanilla EKS Deployment + (HyperPod Inference Operator
Service + HPA, upstream InferenceEndpointConfig CRD,
vllm/vllm-openai image) AWS-managed vLLM DLC image)
| |
+---------------+---------------+
|
OpenAI-compatible
/v1/chat/completions
|
examples/
image VQA / video VQA / SDG auto-label
```

### Why two paths?

| Path | Purpose |
|------|---------|
| `kubernetes/` | Plain EKS users who already have a vLLM deployment. No HyperPod required. Vendor `vllm/vllm-openai` image. |
| `hyperpod-eks/` | HyperPod EKS clusters using the [Inference Operator](https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/) — auto KEDA + Karpenter scale-to-zero, managed KV cache, intelligent routing. AWS-managed `vllm` Deep Learning Container. |

Both paths serve the **same** model with the **same** vLLM CLI args. Pick the path that
matches your platform.

## Prerequisites

| Requirement | Detail |
|-------------|--------|
| EKS cluster | Kubernetes ≥ 1.28, GPU-capable |
| GPU node | One of: g5.* (A10G 24 GB), g6.12xlarge (4× L4 24 GB), g6e.* (L40S 48 GB), p4d/p5/p5e (H100/H200) |
| NVIDIA device plugin | `nvidia-device-plugin` DaemonSet running on GPU nodes |
| Hugging Face token | Required — Cosmos Reason models are gated on HF (NVIDIA Open Model License acceptance). [Request access](https://huggingface.co/nvidia/Cosmos-Reason1-7B) on the model card first. |
| For `hyperpod-eks/` only | HyperPod Inference Operator installed (`hyperpod-inference-operator` Helm chart, image `v3.1`). Comes with `sagemaker-hyperpod-cli` v3.7.1+ or one-click EKS add-on. |

## Models

This test case is parameterized via `${MODEL_ID}` in `env_vars.example`. Supported models:

| Model | Backbone | Min GPU memory | Recommended `--max-model-len` | Reasoning parser |
|-------|----------|---------------|------------------------------|------------------|
| **`nvidia/Cosmos-Reason1-7B`** (default for sm_8x GPUs) | Qwen2.5-VL | 24 GB (A10G/L4 OK) | 24576 (24 GB GPUs) / 32768 (40+ GB) | None — `<think>` is inline in `content` |
| **`nvidia/Cosmos-Reason2-2B`** | Qwen3-VL | 24 GB | 16384 | `--reasoning-parser qwen3` (separates `<think>` into `reasoning_content`) |
| **`nvidia/Cosmos-Reason2-8B`** (default for sm_9x GPUs) | Qwen3-VL | 32 GB (L40S/H100/H200) | 16384 | `--reasoning-parser qwen3` |

> [!NOTE]
> NVIDIA validates Cosmos Reason on H100 / GB200 / DGX Spark / Jetson AGX Thor.
> A10G (sm_86) and L4 (sm_89) are not on NVIDIA's official validated list but work in
> practice; this test case has been empirically validated on g5.8xlarge (1× A10G 24 GB).

## Validation Status

The configurations below were tested end-to-end on a SageMaker HyperPod EKS cluster
in `us-west-2`. All three example clients (`image_vqa`, `video_qa`, `auto_label`)
were exercised against both deployment paths.

| Path | Model | Hardware | image_vqa | video_qa | auto_label |
|------|-------|----------|-----------|----------|------------|
| `kubernetes/` | Reason1-7B | g5.8xlarge (A10G, TP=1) | 18.3 s | 28.9 s | 20.3 s |
| `kubernetes/` | Reason2-8B | g6.12xlarge (4× L4, TP=4) | 8.2 s | 16.9 s | 3.4 s |
| `hyperpod-eks/` | Reason1-7B | ml.g5.8xlarge (A10G, TP=1) | 21.3 s | unsupported¹ | 18.1 s |
| `hyperpod-eks/` | Reason2-8B | ml.g6.12xlarge (4× L4, TP=4) | 13.8 s | unsupported¹ | 4.5 s |

¹ The AWS vLLM DLC `vllm:0.17-gpu-py312` (vLLM 0.17.1) does not expose

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small inconsistency on the DLC version: this tag is described as vLLM 0.17.1 here and at line 238, but as 0.17.0 in env_vars.example:30 and hyperpod-eks/README.md:146. Might be worth aligning these to a single version.

`--mm-processor-kwargs` through the Inference Operator. The sample 5.3 MB meteor
clip tokenizes to ~19K embedding tokens, exceeding the default pre-allocated
16384-token encoder cache. The `kubernetes/` path uses upstream vLLM v0.21.0 and
includes the necessary args by default — use it for video workloads.

### Empirical findings

- **Reason2-8B on g6.12xlarge (4× L4, TP=4) is 2–5× faster** than Reason1-7B on a
single A10G for image workloads. Speedup is driven by tensor parallelism and the
L4's higher fp16/bf16 throughput, not model size.
- **Video workloads need expanded encoder cache.** Default vLLM config (16384-token
cache) cannot handle the sample video; shipped `kubernetes/deployment.yaml` adds
`--limit-mm-per-prompt`, `--mm-processor-kwargs '{"max_pixels":20000000,"fps":1.0}'`,
and bumps `MAX_MODEL_LEN=24576` to make video work out of the box.
- **`--reasoning-parser qwen3` is Reason2-only.** Enabling it for Reason1
(Qwen2.5-VL backbone) causes `RuntimeError: Engine core initialization failed`.
`hyperpod-eks/endpoint.yaml` ships with it commented out by default — uncomment
for Reason2.

## Quick Start (vanilla EKS, vendor image)

> [!IMPORTANT]
> **Video workloads require expanded encoder cache.**
> `kubernetes/deployment.yaml` ships with `--limit-mm-per-prompt`,
> `--mm-processor-kwargs '{"max_pixels":20000000,"fps":1.0}'`, and `MAX_MODEL_LEN=24576`
> so the sample video works without modification. If you swap in a larger or
> higher-resolution clip and see `400 exceeds the pre-allocated encoder cache size`,
> increase `max_pixels`. The `hyperpod-eks/` DLC path does not currently expose
> these flags through the Inference Operator and is recommended for image workloads only.

```bash
# 1. Set environment
cp env_vars.example env_vars
# Edit env_vars — at minimum set HF_TOKEN
source env_vars

# 2. Validate required variables are set
for v in INSTANCE_TYPE MODEL_ID NAMESPACE TENSOR_PARALLEL_SIZE \
MAX_MODEL_LEN GPU_MEMORY_UTILIZATION DTYPE INVOCATION_PORT \
VLLM_IMAGE_VANILLA HF_TOKEN; do
[ -z "${!v}" ] && echo "ERROR: \$$v is unset" && exit 1
done

# 3. Create HF token Secret
kubectl create secret generic hf-token \
--from-literal=token=${HF_TOKEN}

# 4. Dry-run to confirm manifest renders correctly
envsubst < kubernetes/deployment.yaml | kubectl apply --dry-run=client -f -

# 5. Deploy
envsubst < kubernetes/deployment.yaml | kubectl apply -f -

# 6. Wait for /health (3-5 min for first launch — HF download + CUDA graph)
kubectl wait --for=condition=Ready pod -l app=cosmos-reason --timeout=10m

# 7. Port-forward and test
kubectl port-forward deploy/cosmos-reason 8000:8000 &
python3 examples/image_vqa.py --image examples/sample.jpg
```

> [!NOTE]
> The default `<think>/<answer>` system prompt and `--reasoning-parser qwen3` are
> mutually exclusive:
> - **Reason1 (Qwen2.5-VL):** Keep the default system prompt. Do NOT enable the
> reasoning parser. `<think>...</think>` appears inline in `content`.
> - **Reason2 (Qwen3-VL):** Omit the system prompt (`--system-prompt ""`) and
> enable `--reasoning-parser qwen3`. Reasoning is split into `reasoning_content`.
> Leaving the system prompt on can cause Reason2 to produce minimal answers.

## Quick Start (HyperPod Inference Operator)

```bash
# 1. Set environment (same env_vars as above)
source env_vars

# 2. Create HF token Secret in the same namespace as the InferenceEndpointConfig
kubectl create secret generic hf-token \
--from-literal=token=${HF_TOKEN}

# 3. Deploy via the operator
envsubst < hyperpod-eks/endpoint.yaml | kubectl apply -f -

# 4. Wait for the operator to mark the endpoint Ready
kubectl get inferenceendpointconfigs -w

# 5. Invoke through the operator-managed ALB (or via SageMaker SDK)
python3 examples/image_vqa.py \
--endpoint $(kubectl get inferenceendpointconfig cosmos-reason -o jsonpath='{.status.endpointUrl}') \
--image examples/sample.jpg
```

## Use Cases (verified examples in `examples/`)

| Script | What it does |
|--------|--------------|
| `examples/image_vqa.py` | Single-image visual question answering. Pattern: drive-recorder review, content moderation. |
| `examples/video_qa.py` | Short video clip Q&A. Pattern: AV scene understanding (Uber pattern). |
| `examples/auto_label.py` | Batch auto-labeling with `<think>...</think><answer>...</answer>` schema. Pattern: SDG critic loop, training data curation. |

## Configuration Reference

| Variable | Default | Purpose |
|----------|---------|---------|
| `MODEL_ID` | `nvidia/Cosmos-Reason1-7B` | HF model ID. Override to `Cosmos-Reason2-8B` on L40S/H100. |
| `IMAGE_TAG` (kubernetes) | `vllm/vllm-openai:v0.21.0` | Upstream vLLM container. Pin to a specific version, never `:latest`. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Configuration Reference lists IMAGE_TAG, but the actual env vars are VLLM_IMAGE_VANILLA and VLLM_IMAGE_AWS_DLC (env_vars.example:27/31). IMAGE_TAG doesn't exist anywhere — a user grepping for it finds nothing.

| `IMAGE_TAG` (hyperpod-eks) | `vllm:0.17-gpu-py312` (AWS DLC) | AWS-managed vLLM DLC. ECR path: `763104351884.dkr.ecr.${AWS_REGION}.amazonaws.com/vllm:0.17-gpu-py312` |
| `INSTANCE_TYPE` | `ml.g5.8xlarge` | A10G 24 GB. Other validated: `ml.g6.12xlarge` (4×L4), `ml.g6e.4xlarge` (1×L40S 48 GB). |
| `MAX_MODEL_LEN` | `24576` | Sized for video out-of-the-box on 24 GB GPUs. Reduce to `8192` if OOM during CUDA graph capture; increase to `32768` on 40 GB+ GPUs. Cosmos Reason native context is 256K — must reduce for non-H100 hardware. |
| `GPU_MEMORY_UTILIZATION` | `0.92` | vLLM target memory headroom. Reduce to `0.85` if OOM during CUDA graph capture. |
| `TENSOR_PARALLEL_SIZE` | `1` | Single GPU for 7B/2B. Set to `4` for 8B on g6.12xlarge (4× L4 24 GB). |
| `NAMESPACE` | `default` | Kubernetes namespace |
| `HF_TOKEN` | (none — required) | Hugging Face token with model access. Stored as a `Secret`. |

## Cleanup

```bash
# Vanilla EKS path
kubectl delete -f kubernetes/

# HyperPod Inference Operator path
kubectl delete -f hyperpod-eks/

# Both paths
kubectl delete secret hf-token
```

## Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| `GatedRepoError: 401` on first deploy | No `HF_TOKEN` provided | Set `HF_TOKEN` env var, recreate `hf-token` Secret |
| `GatedRepoError: 403` after providing token | Token valid but account not on access list | Visit the model card on HF and click "Request Access". NVIDIA-gated models require accepting the NVIDIA Open Model License. |
| Pod stuck in `ContainerCreating` for 5+ min | Image pull (12 GB vendor / 23 GB AWS DLC) | Normal on first deploy. Check `kubectl describe pod` for "Pulling image". |
| Pod `Running` but `/health` returns 404 | vLLM still loading model + compiling CUDA graphs | Wait. First-launch is 3-8 min on 7B-class models. With `--enforce-eager` skips CUDA graphs (faster startup, slower inference). |
| `OutOfMemoryError` during CUDA graph capture | `--gpu-memory-utilization` too high or `--max-model-len` too long | Drop `--gpu-memory-utilization` to `0.85`, drop `--max-model-len` to `4096`. |
| Flash-Attention `headdim not multiple of 32` error (Reason2 / Qwen3-VL only) | vLLM internal fork of FA rejects Qwen3-VL ViT head dims | Do NOT set `VLLM_ATTENTION_BACKEND=FLASH_ATTN`. Let vLLM auto-pick. Issue [#27562](https://github.com/vllm-project/vllm/issues/27562) closed Apr 2026; v0.21.0+ is fixed. |
| Reason2 returns `<think>` text inline in `content` (not separate `reasoning_content`) | Missing `--reasoning-parser qwen3` | Add `--reasoning-parser qwen3` to vLLM args. Required for Reason2; not applicable to Reason1. |
| Latency low / TTFT high | `--enforce-eager` skips CUDA graphs | Remove `--enforce-eager` to enable graph compilation. Adds ~2 min to startup but ~30% throughput improvement. |
| `400 The decoder prompt contains a(n) video item with X embedding tokens, which exceeds the pre-allocated encoder cache size` | Default encoder cache too small for the input video (16384 for Reason1, ~5000 for Reason2-8B) | On the `kubernetes/` path, raise `--mm-processor-kwargs '{"max_pixels":...,"fps":1.0}'` until cache > video tokens. On `hyperpod-eks/`, use a shorter clip (≤5 s @ 480p) or switch to `kubernetes/`. |
| `RuntimeError: Engine core initialization failed` after model load on `hyperpod-eks/` with Reason1 | `--reasoning-parser qwen3` enabled but Reason1 uses Qwen2.5-VL backbone (parser is Qwen3-only) | Comment out `--reasoning-parser qwen3` and `SM_VLLM_REASONING_PARSER=qwen3` in `hyperpod-eks/endpoint.yaml`. Re-enable only for Reason2. |
| `kubectl get secret hf-token -o jsonpath='{.data.token}' \| base64 -d` returns `REPLACE_WITH_HF_TOKEN` | Applied `hf-token-secret.yaml.example` directly without replacing the literal placeholder (it is not an envsubst template) | Use `kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN` per Quick Start. The `.example` YAML is reference-only. |
| Video request 400 on `hyperpod-eks/` even with short clip | DLC v0.17.1 does not surface `--mm-processor-kwargs` or `--limit-mm-per-prompt` to the Inference Operator | Use the `kubernetes/` path for video workloads. The DLC ships with a fixed encoder-cache budget; expanding it requires a custom DLC image or a newer DLC tag when available. |

## References

- NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/
- Cosmos Reason1-7B model card: https://huggingface.co/nvidia/Cosmos-Reason1-7B
- Cosmos Reason2-8B model card: https://huggingface.co/nvidia/Cosmos-Reason2-8B
- Cosmos Reason2 repo (NVIDIA): https://github.com/nvidia-cosmos/cosmos-reason2
- vLLM: https://github.com/vllm-project/vllm
- vLLM Qwen3-VL recipe: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html
- AWS vLLM DLC repo: https://github.com/aws/deep-learning-containers
- HyperPod Inference Operator setup blog: https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/
- HyperPod Inference Operator best practices: https://aws.amazon.com/blogs/machine-learning/best-practices-to-run-inference-on-amazon-sagemaker-hyperpod/
63 changes: 63 additions & 0 deletions 3.test_cases/pytorch/vllm/cosmos-reason/env_vars.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
#
# Source this file with: source env_vars
# All variables here are consumed by the kubernetes/ and hyperpod-eks/ manifests
# (rendered with `envsubst`) and by the examples/ scripts.

# ---- AWS / cluster context ----
export AWS_REGION="us-west-2"
export AWS_ACCOUNT_ID="123456789012"
export NAMESPACE="default"

# ---- Model selection ----
# Default: Cosmos-Reason1-7B (Qwen2.5-VL backbone). Fits on A10G 24 GB / L4 24 GB / L40S 48 GB.
# Alternates:
# nvidia/Cosmos-Reason2-2B (Qwen3-VL, 2B params, also fits 24 GB GPUs)
# nvidia/Cosmos-Reason2-8B (Qwen3-VL, 8B params, requires ≥32 GB GPU)
export MODEL_ID="nvidia/Cosmos-Reason1-7B"

# Cosmos Reason models are gated on Hugging Face — accept terms on the model card first,
# then create a token at https://huggingface.co/settings/tokens with read access.
# DO NOT commit this value. Pass it via your shell or the CI secret store.
export HF_TOKEN=""

# ---- vLLM container ----
# kubernetes/ (vanilla EKS) — upstream image
export VLLM_IMAGE_VANILLA="vllm/vllm-openai:v0.21.0"

# hyperpod-eks/ (Inference Operator) — AWS-managed vLLM DLC
# Tags: vllm:0.17-gpu-py312 (vLLM 0.17.0) | vllm:server-sagemaker-cuda-v1 (vLLM 0.19.1)
export VLLM_IMAGE_AWS_DLC="763104351884.dkr.ecr.${AWS_REGION}.amazonaws.com/vllm:0.17-gpu-py312"

# Additionally for HyperPod path: set the TLS bucket. You can find this in the HyperPod console within the `Inference` tab called `S3 bucket for TLS certificates`
export TLS_CERT_S3_URI="s3://hyperpod-tls-<id>/certs"

# ---- Hardware sizing ----
# Validated combinations (model | GPU | TP | max-model-len):
# Cosmos-Reason1-7B | A10G 24G | TP=1 | 24576 (reduce to 8192 if OOM during CUDA graph capture)
# Cosmos-Reason1-7B | L4 24G | TP=1 | 24576 (reduce to 8192 if OOM during CUDA graph capture)
# Cosmos-Reason1-7B | L40S 48G | TP=1 | 32768
# Cosmos-Reason2-2B | A10G/L4 | TP=1 | 16384
# Cosmos-Reason2-8B | L40S 48G | TP=1 | 16384
# Cosmos-Reason2-8B | g6.12xl | TP=4 | 16384 (4× L4, PCIe-only — slower than NVLink)
# Cosmos-Reason2-8B | H100 80G | TP=1 | 32768
export INSTANCE_TYPE="g5.8xlarge"
export HYPERPOD_INSTANCE_TYPE="ml.${INSTANCE_TYPE}"
export TENSOR_PARALLEL_SIZE="1"
export MAX_MODEL_LEN="24576"
export GPU_MEMORY_UTILIZATION="0.92"

# ---- vLLM serving args ----
# DTYPE: bfloat16 is the only NVIDIA-tested precision for Cosmos Reason
export DTYPE="bfloat16"

# ---- Reason2 (Qwen3-VL) vLLM args ----
# If deploying Cosmos-Reason2-*, you must manually edit the deployment manifest:
# kubernetes/deployment.yaml: uncomment the --reasoning-parser and --media-io-kwargs lines
# hyperpod-eks/endpoint.yaml: uncomment the --reasoning-parser and SM_VLLM_REASONING_PARSER lines
# See the Troubleshooting section in README.md for details.

# ---- Endpoint defaults ----
export ENDPOINT_NAME="cosmos-reason"
export INVOCATION_PORT="8000"
8 changes: 8 additions & 0 deletions 3.test_cases/pytorch/vllm/cosmos-reason/examples/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Downloaded sample media (fetched by download_samples.sh)
sample.jpg
sample_meteor.webm
*.webm
*.mp4

# JSONL output from auto_label.py
*.jsonl
Loading