diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/README.md
new file mode 100644
index 000000000..629e33db8
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/README.md
@@ -0,0 +1,250 @@
+# NVIDIA Cosmos Reason — vLLM Inference on Amazon EKS / HyperPod
+
+Online inference reference for NVIDIA's [Cosmos Reason](https://www.nvidia.com/en-us/ai/cosmos/)
+physical-reasoning Vision-Language Model, served by [vLLM](https://github.com/vllm-project/vllm)
+on Amazon EKS or SageMaker HyperPod EKS.
+
+Cosmos Reason is the physical-reasoning member of NVIDIA's Cosmos World Foundation Model
+platform. It generates chain-of-thought reasoning traces about safety, causality, object
+interactions, and spatiotemporal dynamics in videos and images — the building block for
+auto-labeling Synthetic Data Generation (SDG) pipelines
+([as adopted by Uber per NVIDIA SIGGRAPH 2025](https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/)),
+Reasoning VLA models, and video Q&A workloads.
+
+## Production Use Cases
+
+| Use Case | Pattern | Example |
+|----------|---------|---------|
+| AV training data labeling | SDG critic loop — auto-label scenes with structured metadata (objects, hazards, weather) | [AV data captioning (NVIDIA SIGGRAPH 2025)](https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/) |
+| Video scene understanding | Short clip Q&A with chain-of-thought physical reasoning | Dashcam review, warehouse safety monitoring |
+| Image VQA with physical reasoning | Single-image spatial/causal analysis | Content moderation, quality inspection |
+| SDG critic / verifier | Judge whether Cosmos Predict outputs are physically plausible before entering training sets | Synthetic data filtering |
+| RL reward model | Score trajectories for physical coherence as a reward signal in RLHF or model-based RL | Robotics policy training, AV planning |
+| Offline RL annotation | Label trajectory data with reasoning traces for offline RL training | Decision Transformer reward labels |
+
+## Two Paths
+This test case provides **two parallel deployment paths**, both `kubectl`-only:
+
+```
+                      Cosmos Reason on EKS
+                              |
+              +---------------+---------------+
+              |                               |
+         kubernetes/                    hyperpod-eks/
+   (vanilla EKS Deployment +     (HyperPod Inference Operator
+    Service + HPA, upstream       InferenceEndpointConfig CRD,
+    vllm/vllm-openai image)       AWS-managed vLLM DLC image)
+              |                               |
+              +---------------+---------------+
+                              |
+                       OpenAI-compatible
+                       /v1/chat/completions
+                              |
+                          examples/
+                  image VQA / video VQA / SDG auto-label
+```
+
+### Why two paths?
+
+| Path | Purpose |
+|------|---------|
+| `kubernetes/` | Plain EKS users who already have a vLLM deployment. No HyperPod required. Vendor `vllm/vllm-openai` image. |
+| `hyperpod-eks/` | HyperPod EKS clusters using the [Inference Operator](https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/) — auto KEDA + Karpenter scale-to-zero, managed KV cache, intelligent routing. AWS-managed `vllm` Deep Learning Container. |
+
+Both paths serve the **same** model with the **same** vLLM CLI args. Pick the path that
+matches your platform.
+
+## Prerequisites
+
+| Requirement | Detail |
+|-------------|--------|
+| EKS cluster | Kubernetes ≥ 1.28, GPU-capable |
+| GPU node | One of: g5.* (A10G 24 GB), g6.12xlarge (4× L4 24 GB), g6e.* (L40S 48 GB), p4d/p5/p5e (H100/H200) |
+| NVIDIA device plugin | `nvidia-device-plugin` DaemonSet running on GPU nodes |
+| Hugging Face token | Required — Cosmos Reason models are gated on HF (NVIDIA Open Model License acceptance). [Request access](https://huggingface.co/nvidia/Cosmos-Reason1-7B) on the model card first. |
+| For `hyperpod-eks/` only | HyperPod Inference Operator installed. Recommended: EKS add-on `amazon-sagemaker-hyperpod-inference` (`v1.3.0-eksbuild.1`). Alternatives: `sagemaker-hyperpod-cli` v3.7.0+ (`hyp install`), or Helm chart `hyperpod-inference-operator` v2.1.0 (operator image `v3.1`). See [hyperpod-eks/README.md](hyperpod-eks/README.md#prerequisites). |
+
+## Models
+
+This test case is parameterized via `${MODEL_ID}` in `env_vars.example`. Supported models:
+
+| Model | Backbone | Min GPU memory | Recommended `--max-model-len` | Reasoning parser |
+|-------|----------|---------------|------------------------------|------------------|
+| **`nvidia/Cosmos-Reason1-7B`** (default for sm_8x GPUs) | Qwen2.5-VL | 24 GB (A10G/L4 OK) | 24576 (24 GB GPUs) / 32768 (40+ GB) | None — `<think>` is inline in `content` |
+| **`nvidia/Cosmos-Reason2-2B`** | Qwen3-VL | 24 GB | 16384 | `--reasoning-parser qwen3` (separates `<think>` into `reasoning_content`) |
+| **`nvidia/Cosmos-Reason2-8B`** (default for sm_9x GPUs) | Qwen3-VL | 32 GB (L40S/H100/H200) | 16384 | `--reasoning-parser qwen3` |
+
+> [!NOTE]
+> NVIDIA validates Cosmos Reason on H100 / GB200 / DGX Spark / Jetson AGX Thor.
+> A10G (sm_86) and L4 (sm_89) are not on NVIDIA's official validated list but work in
+> practice; this test case has been empirically validated on g5.8xlarge (1× A10G 24 GB).
+
+## Validation Status
+
+The configurations below were tested end-to-end on a SageMaker HyperPod EKS cluster
+in `us-west-2`. All three example clients (`image_vqa`, `video_qa`, `auto_label`)
+were exercised against both deployment paths.
+
+| Path | Model | Hardware | image_vqa | video_qa | auto_label |
+|------|-------|----------|-----------|----------|------------|
+| `kubernetes/` | Reason1-7B | g5.8xlarge (A10G, TP=1) | 18.3 s | 28.9 s | 20.3 s |
+| `kubernetes/` | Reason2-8B | g6.12xlarge (4× L4, TP=4) | 8.2 s | 16.9 s | 3.4 s |
+| `hyperpod-eks/` | Reason1-7B | ml.g5.8xlarge (A10G, TP=1) | 21.3 s | unsupported¹ | 18.1 s |
+| `hyperpod-eks/` | Reason2-8B | ml.g6.12xlarge (4× L4, TP=4) | 13.8 s | unsupported¹ | 4.5 s |
+
+¹ The AWS vLLM DLC `vllm:0.17-gpu-py312` (vLLM 0.17.0) does not expose
+`--mm-processor-kwargs` through the Inference Operator. The sample 5.3 MB meteor
+clip tokenizes to ~19K embedding tokens, exceeding the default pre-allocated
+16384-token encoder cache. The `kubernetes/` path uses upstream vLLM v0.21.0 and
+includes the necessary args by default — use it for video workloads.
+
+### Empirical findings
+
+- **Reason2-8B on g6.12xlarge (4× L4, TP=4) is 2–5× faster** than Reason1-7B on a
+  single A10G for image workloads. Speedup is driven by tensor parallelism and the
+  L4's higher fp16/bf16 throughput, not model size.
+- **Video workloads need expanded encoder cache.** Default vLLM config (16384-token
+  cache) cannot handle the sample video; shipped `kubernetes/deployment.yaml` adds
+  `--limit-mm-per-prompt`, `--mm-processor-kwargs '{"max_pixels":20000000,"fps":1.0}'`,
+  and bumps `MAX_MODEL_LEN=24576` to make video work out of the box.
+- **`--reasoning-parser qwen3` is Reason2-only.** Enabling it for Reason1
+  (Qwen2.5-VL backbone) causes `RuntimeError: Engine core initialization failed`.
+  `hyperpod-eks/endpoint.yaml` ships with it commented out by default — uncomment
+  for Reason2.
+
+## Quick Start (vanilla EKS, vendor image)
+
+> [!IMPORTANT]
+> **Video workloads require expanded encoder cache.**
+> `kubernetes/deployment.yaml` ships with `--limit-mm-per-prompt`,
+> `--mm-processor-kwargs '{"max_pixels":20000000,"fps":1.0}'`, and `MAX_MODEL_LEN=24576`
+> so the sample video works without modification. If you swap in a larger or
+> higher-resolution clip and see `400 exceeds the pre-allocated encoder cache size`,
+> increase `max_pixels`. The `hyperpod-eks/` DLC path does not currently expose
+> these flags through the Inference Operator and is recommended for image workloads only.
+
+```bash
+# 1. Set environment
+cp env_vars.example env_vars
+# Edit env_vars — at minimum set HF_TOKEN
+source env_vars
+
+# 2. Validate required variables are set
+for v in INSTANCE_TYPE MODEL_ID NAMESPACE TENSOR_PARALLEL_SIZE \
+         MAX_MODEL_LEN GPU_MEMORY_UTILIZATION DTYPE INVOCATION_PORT \
+         VLLM_IMAGE_VANILLA HF_TOKEN; do
+  [ -z "${!v}" ] && echo "ERROR: \$$v is unset" && exit 1
+done
+
+# 3. Create HF token Secret
+kubectl create secret generic hf-token \
+  --from-literal=token=${HF_TOKEN}
+
+# 4. Dry-run to confirm manifest renders correctly
+envsubst < kubernetes/deployment.yaml | kubectl apply --dry-run=client -f -
+
+# 5. Deploy
+envsubst < kubernetes/deployment.yaml | kubectl apply -f -
+
+# 6. Wait for /health (3-5 min for first launch — HF download + CUDA graph)
+kubectl wait --for=condition=Ready pod -l app=cosmos-reason --timeout=10m
+
+# 7. Port-forward and test
+kubectl port-forward deploy/cosmos-reason 8000:8000 &
+python3 examples/image_vqa.py --image examples/sample.jpg
+```
+
+> [!NOTE]
+> The default `<think>/<answer>` system prompt and `--reasoning-parser qwen3` are
+> mutually exclusive:
+> - **Reason1 (Qwen2.5-VL):** Keep the default system prompt. Do NOT enable the
+>   reasoning parser. `<think>...</think>` appears inline in `content`.
+> - **Reason2 (Qwen3-VL):** Omit the system prompt (`--system-prompt ""`) and
+>   enable `--reasoning-parser qwen3`. Reasoning is split into `reasoning_content`.
+>   Leaving the system prompt on can cause Reason2 to produce minimal answers.
+
+## Quick Start (HyperPod Inference Operator)
+
+```bash
+# 1. Set environment (same env_vars as above)
+source env_vars
+
+# 2. Create HF token Secret in the same namespace as the InferenceEndpointConfig
+kubectl create secret generic hf-token \
+  --from-literal=token=${HF_TOKEN}
+
+# 3. Deploy via the operator
+envsubst < hyperpod-eks/endpoint.yaml | kubectl apply -f -
+
+# 4. Wait for the operator to mark the endpoint Ready
+kubectl get inferenceendpointconfigs -w
+
+# 5. Invoke through the operator-managed ALB (or via SageMaker SDK)
+python3 examples/image_vqa.py \
+  --endpoint $(kubectl get inferenceendpointconfig cosmos-reason -o jsonpath='{.status.endpointUrl}') \
+  --image examples/sample.jpg
+```
+
+## Use Cases (verified examples in `examples/`)
+
+| Script | What it does |
+|--------|--------------|
+| `examples/image_vqa.py` | Single-image visual question answering. Pattern: drive-recorder review, content moderation. |
+| `examples/video_qa.py` | Short video clip Q&A. Pattern: AV scene understanding (Uber pattern). |
+| `examples/auto_label.py` | Batch auto-labeling with `<think>...</think><answer>...</answer>` schema. Pattern: SDG critic loop, training data curation. |
+
+## Configuration Reference
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `MODEL_ID` | `nvidia/Cosmos-Reason1-7B` | HF model ID. Override to `Cosmos-Reason2-8B` on L40S/H100. |
+| `VLLM_IMAGE_VANILLA` (kubernetes) | `vllm/vllm-openai:v0.21.0` | Upstream vLLM container. Pin to a specific version, never `:latest`. |
+| `VLLM_IMAGE_AWS_DLC` (hyperpod-eks) | `vllm:0.17-gpu-py312` (AWS DLC) | AWS-managed vLLM DLC. ECR path: `763104351884.dkr.ecr.${AWS_REGION}.amazonaws.com/vllm:0.17-gpu-py312` |
+| `INSTANCE_TYPE` | `ml.g5.8xlarge` | A10G 24 GB. Other validated: `ml.g6.12xlarge` (4×L4), `ml.g6e.4xlarge` (1×L40S 48 GB). |
+| `MAX_MODEL_LEN` | `24576` | Sized for video out-of-the-box on 24 GB GPUs. Reduce to `8192` if OOM during CUDA graph capture; increase to `32768` on 40 GB+ GPUs. Cosmos Reason native context is 256K — must reduce for non-H100 hardware. |
+| `GPU_MEMORY_UTILIZATION` | `0.92` | vLLM target memory headroom. Reduce to `0.85` if OOM during CUDA graph capture. |
+| `TENSOR_PARALLEL_SIZE` | `1` | Single GPU for 7B/2B. Set to `4` for 8B on g6.12xlarge (4× L4 24 GB). |
+| `NAMESPACE` | `default` | Kubernetes namespace |
+| `HF_TOKEN` | (none — required) | Hugging Face token with model access. Stored as a `Secret`. |
+
+## Cleanup
+
+```bash
+# Vanilla EKS path
+kubectl delete -f kubernetes/
+
+# HyperPod Inference Operator path
+kubectl delete -f hyperpod-eks/
+
+# Both paths
+kubectl delete secret hf-token
+```
+
+## Troubleshooting
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| `GatedRepoError: 401` on first deploy | No `HF_TOKEN` provided | Set `HF_TOKEN` env var, recreate `hf-token` Secret |
+| `GatedRepoError: 403` after providing token | Token valid but account not on access list | Visit the model card on HF and click "Request Access". NVIDIA-gated models require accepting the NVIDIA Open Model License. |
+| Pod stuck in `ContainerCreating` for 5+ min | Image pull (12 GB vendor / 23 GB AWS DLC) | Normal on first deploy. Check `kubectl describe pod` for "Pulling image". |
+| Pod `Running` but `/health` returns 404 | vLLM still loading model + compiling CUDA graphs | Wait. First-launch is 3-8 min on 7B-class models. With `--enforce-eager` skips CUDA graphs (faster startup, slower inference). |
+| `OutOfMemoryError` during CUDA graph capture | `--gpu-memory-utilization` too high or `--max-model-len` too long | Drop `--gpu-memory-utilization` to `0.85`, drop `--max-model-len` to `4096`. |
+| Flash-Attention `headdim not multiple of 32` error (Reason2 / Qwen3-VL only) | vLLM internal fork of FA rejects Qwen3-VL ViT head dims | Do NOT set `VLLM_ATTENTION_BACKEND=FLASH_ATTN`. Let vLLM auto-pick. Issue [#27562](https://github.com/vllm-project/vllm/issues/27562) closed Apr 2026; v0.21.0+ is fixed. |
+| Reason2 returns `<think>` text inline in `content` (not separate `reasoning_content`) | Missing `--reasoning-parser qwen3` | Add `--reasoning-parser qwen3` to vLLM args. Required for Reason2; not applicable to Reason1. |
+| Latency low / TTFT high | `--enforce-eager` skips CUDA graphs | Remove `--enforce-eager` to enable graph compilation. Adds ~2 min to startup but ~30% throughput improvement. |
+| `400 The decoder prompt contains a(n) video item with X embedding tokens, which exceeds the pre-allocated encoder cache size` | Default encoder cache too small for the input video (16384 for Reason1, ~5000 for Reason2-8B) | On the `kubernetes/` path, raise `--mm-processor-kwargs '{"max_pixels":...,"fps":1.0}'` until cache > video tokens. On `hyperpod-eks/`, use a shorter clip (≤5 s @ 480p) or switch to `kubernetes/`. |
+| `RuntimeError: Engine core initialization failed` after model load on `hyperpod-eks/` with Reason1 | `--reasoning-parser qwen3` enabled but Reason1 uses Qwen2.5-VL backbone (parser is Qwen3-only) | Comment out `--reasoning-parser qwen3` and `SM_VLLM_REASONING_PARSER=qwen3` in `hyperpod-eks/endpoint.yaml`. Re-enable only for Reason2. |
+| `kubectl get secret hf-token -o jsonpath='{.data.token}' \| base64 -d` returns `REPLACE_WITH_HF_TOKEN` | Applied `hf-token-secret.yaml.example` directly without replacing the literal placeholder (it is not an envsubst template) | Use `kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN` per Quick Start. The `.example` YAML is reference-only. |
+| Video request 400 on `hyperpod-eks/` even with short clip | DLC v0.17.0 does not surface `--mm-processor-kwargs` or `--limit-mm-per-prompt` to the Inference Operator | Use the `kubernetes/` path for video workloads. The DLC ships with a fixed encoder-cache budget; expanding it requires a custom DLC image or a newer DLC tag when available. |
+
+## References
+
+- NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/
+- Cosmos Reason1-7B model card: https://huggingface.co/nvidia/Cosmos-Reason1-7B
+- Cosmos Reason2-8B model card: https://huggingface.co/nvidia/Cosmos-Reason2-8B
+- Cosmos Reason2 repo (NVIDIA): https://github.com/nvidia-cosmos/cosmos-reason2
+- vLLM: https://github.com/vllm-project/vllm
+- vLLM Qwen3-VL recipe: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html
+- AWS vLLM DLC repo: https://github.com/aws/deep-learning-containers
+- HyperPod Inference Operator setup blog: https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/
+- HyperPod Inference Operator best practices: https://aws.amazon.com/blogs/machine-learning/best-practices-to-run-inference-on-amazon-sagemaker-hyperpod/
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/env_vars.example b/3.test_cases/pytorch/vllm/cosmos-reason/env_vars.example
new file mode 100644
index 000000000..62ebc16d3
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/env_vars.example
@@ -0,0 +1,63 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+#
+# Source this file with: source env_vars
+# All variables here are consumed by the kubernetes/ and hyperpod-eks/ manifests
+# (rendered with `envsubst`) and by the examples/ scripts.
+
+# ---- AWS / cluster context ----
+export AWS_REGION="us-west-2"
+export AWS_ACCOUNT_ID="123456789012"
+export NAMESPACE="default"
+
+# ---- Model selection ----
+# Default: Cosmos-Reason1-7B (Qwen2.5-VL backbone). Fits on A10G 24 GB / L4 24 GB / L40S 48 GB.
+# Alternates:
+#   nvidia/Cosmos-Reason2-2B  (Qwen3-VL, 2B params, also fits 24 GB GPUs)
+#   nvidia/Cosmos-Reason2-8B  (Qwen3-VL, 8B params, requires ≥32 GB GPU)
+export MODEL_ID="nvidia/Cosmos-Reason1-7B"
+
+# Cosmos Reason models are gated on Hugging Face — accept terms on the model card first,
+# then create a token at https://huggingface.co/settings/tokens with read access.
+# DO NOT commit this value. Pass it via your shell or the CI secret store.
+export HF_TOKEN=""
+
+# ---- vLLM container ----
+# kubernetes/ (vanilla EKS) — upstream image
+export VLLM_IMAGE_VANILLA="vllm/vllm-openai:v0.21.0"
+
+# hyperpod-eks/ (Inference Operator) — AWS-managed vLLM DLC
+# Tags: vllm:0.17-gpu-py312 (vLLM 0.17.0)  |  vllm:server-sagemaker-cuda-v1 (vLLM 0.19.1)
+export VLLM_IMAGE_AWS_DLC="763104351884.dkr.ecr.${AWS_REGION}.amazonaws.com/vllm:0.17-gpu-py312"
+
+# Additionally for HyperPod path: set the TLS bucket. You can find this in the HyperPod console within the `Inference` tab called `S3 bucket for TLS certificates`
+export TLS_CERT_S3_URI="s3://hyperpod-tls-<id>/certs"
+
+# ---- Hardware sizing ----
+# Validated combinations (model | GPU | TP | max-model-len):
+#   Cosmos-Reason1-7B  | A10G 24G   | TP=1 | 24576 (reduce to 8192 if OOM during CUDA graph capture)
+#   Cosmos-Reason1-7B  | L4  24G    | TP=1 | 24576 (reduce to 8192 if OOM during CUDA graph capture)
+#   Cosmos-Reason1-7B  | L40S 48G   | TP=1 | 32768
+#   Cosmos-Reason2-2B  | A10G/L4    | TP=1 | 16384
+#   Cosmos-Reason2-8B  | L40S 48G   | TP=1 | 16384
+#   Cosmos-Reason2-8B  | g6.12xl    | TP=4 | 16384  (4× L4, PCIe-only — slower than NVLink)
+#   Cosmos-Reason2-8B  | H100 80G   | TP=1 | 32768
+export INSTANCE_TYPE="g5.8xlarge"
+export HYPERPOD_INSTANCE_TYPE="ml.${INSTANCE_TYPE}"
+export TENSOR_PARALLEL_SIZE="1"
+export MAX_MODEL_LEN="24576"
+export GPU_MEMORY_UTILIZATION="0.92"
+
+# ---- vLLM serving args ----
+# DTYPE: bfloat16 is the only NVIDIA-tested precision for Cosmos Reason
+export DTYPE="bfloat16"
+
+# ---- Reason2 (Qwen3-VL) vLLM args ----
+# If deploying Cosmos-Reason2-*, you must manually edit the deployment manifest:
+#   kubernetes/deployment.yaml: uncomment the --reasoning-parser and --media-io-kwargs lines
+#   hyperpod-eks/endpoint.yaml: uncomment the --reasoning-parser and SM_VLLM_REASONING_PARSER lines
+# See the Troubleshooting section in README.md for details.
+
+# ---- Endpoint defaults ----
+export ENDPOINT_NAME="cosmos-reason"
+export INVOCATION_PORT="8000"
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/.gitignore b/3.test_cases/pytorch/vllm/cosmos-reason/examples/.gitignore
new file mode 100644
index 000000000..2980dad9d
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/.gitignore
@@ -0,0 +1,8 @@
+# Downloaded sample media (fetched by download_samples.sh)
+sample.jpg
+sample_meteor.webm
+*.webm
+*.mp4
+
+# JSONL output from auto_label.py
+*.jsonl
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/examples/README.md
new file mode 100644
index 000000000..859de6291
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/README.md
@@ -0,0 +1,58 @@
+# Cosmos Reason — Client Examples
+
+Reference Python clients exercising three Cosmos Reason use cases against an OpenAI-compatible
+vLLM endpoint.
+
+| Script | Use case | Latency target |
+|--------|----------|---------------|
+| `image_vqa.py` | Single-image visual Q&A | < 1 s for short reply |
+| `video_qa.py` | Short video clip Q&A | 5-15 s |
+| `auto_label.py` | SDG critic loop — `<think>` reasoning + structured `<answer>` JSON | 10-30 s |
+
+## Setup
+
+```bash
+pip install requests urllib3
+
+# Download sample media (image + video)
+./download_samples.sh
+
+# If Pod is in-cluster, port-forward first:
+kubectl port-forward svc/cosmos-reason 8000:8000 &
+
+# OR set the operator-managed endpoint URL:
+export ENDPOINT="https://cosmos-reason-<id>.elb.<region>.amazonaws.com"
+```
+
+By default all scripts hit `http://localhost:8000`. Override with `--endpoint` or
+`$ENDPOINT`. Use `--insecure` if the endpoint uses a self-signed TLS certificate
+(e.g., operator-managed ALB).
+
+## Examples
+
+```bash
+# Single image
+python3 image_vqa.py --image sample.jpg \
+  --prompt "What is the safety risk in this scene?"
+
+# Short video clip
+python3 video_qa.py --video sample_meteor.webm \
+  --prompt "Describe what is happening in this video."
+
+# Batch SDG auto-labeling (with retry on transient errors)
+python3 auto_label.py --image-dir . --output labels.jsonl --limit 1
+
+# With self-signed cert (operator-managed ALB)
+python3 image_vqa.py --endpoint https://cosmos-reason.elb.us-west-2.amazonaws.com \
+  --image sample.jpg --insecure
+```
+
+## Notes
+
+- Cosmos-Reason1 (Qwen2.5-VL) emits `<think>...</think><answer>...</answer>` inline
+  in the `content` field. The scripts here parse those tags.
+- Cosmos-Reason2 (Qwen3-VL) with `--reasoning-parser qwen3` separates `<think>` into
+  the response's `reasoning_content` field. The scripts handle both formats.
+- `MODEL_ID` is read from `$MODEL_ID` env var, defaulting to `nvidia/Cosmos-Reason1-7B`.
+- `auto_label.py` supports `--max-retries N` (default 3) for transient HTTP errors
+  (429, 502, 503, 504) with exponential backoff.
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/auto_label.py b/3.test_cases/pytorch/vllm/cosmos-reason/examples/auto_label.py
new file mode 100755
index 000000000..3e93d6f84
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/auto_label.py
@@ -0,0 +1,187 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+"""
+auto_label.py — Synthetic Data Generation (SDG) auto-labeling using Cosmos Reason.
+
+Pattern: AV training-data captioning, as adopted by Uber
+(https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/).
+Each input image gets a structured JSON label plus the model's chain-of-thought
+reasoning trace. Useful for filtering implausible Cosmos-Predict outputs in an
+SDG critic loop, or for bootstrapping training labels.
+
+Output: one JSON object per line (JSONL).
+
+Example:
+    python3 auto_label.py --image-dir ./scenes/ --output labels.jsonl
+    python3 auto_label.py --image-dir ./scenes/ --schema custom_schema.json
+"""
+
+import argparse
+import json
+import os
+import re
+import sys
+import time
+from pathlib import Path
+from typing import Optional
+
+import requests
+import urllib3
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
+
+from image_vqa import encode_image, parse_reasoning_response  # reuse helpers
+
+
+DEFAULT_SCHEMA = {
+    "scene": "string — short description",
+    "objects": "list[string] — primary visible objects",
+    "hazards": "list[string] — identified safety concerns",
+    "weather": "string — clear / rain / snow / fog / cloudy / unknown",
+    "time_of_day": "string — dawn / day / dusk / night / unknown",
+}
+
+
+def make_session(max_retries: int) -> requests.Session:
+    """Build a requests.Session with retry logic for transient HTTP errors."""
+    session = requests.Session()
+    retry = Retry(
+        total=max_retries,
+        backoff_factor=1.0,
+        status_forcelist=[429, 502, 503, 504],
+        allowed_methods=["POST"],
+    )
+    session.mount("http://", HTTPAdapter(max_retries=retry))
+    session.mount("https://", HTTPAdapter(max_retries=retry))
+    return session
+
+
+def extract_json_from_answer(answer: str) -> Optional[dict]:
+    """Try hard to pull a JSON object out of the model's <answer> block."""
+    if not answer:
+        return None
+    # JSON inside ```json ... ``` fence
+    fence = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", answer, re.DOTALL)
+    if fence:
+        try:
+            return json.loads(fence.group(1))
+        except json.JSONDecodeError:
+            pass
+    # Bare JSON object
+    bare = re.search(r"\{.*\}", answer, re.DOTALL)
+    if bare:
+        try:
+            return json.loads(bare.group(0))
+        except json.JSONDecodeError:
+            pass
+    return None
+
+
+def label_image(image_path: Path, endpoint: str, model: str, schema: dict,
+                max_tokens: int, session: requests.Session, verify_tls: bool) -> dict:
+    image_url = encode_image(str(image_path))
+
+    system = (
+        "You are auto-labeling driving scenes for AV training data. "
+        "Output your reasoning in <think>...</think>, then output a JSON label "
+        f"in <answer>...</answer> matching this schema: {json.dumps(schema)}"
+    )
+
+    payload = {
+        "model": model,
+        "messages": [
+            {"role": "system", "content": system},
+            {"role": "user", "content": [
+                {"type": "image_url", "image_url": {"url": image_url}},
+                {"type": "text", "text": "Label this scene."},
+            ]},
+        ],
+        "max_tokens": max_tokens,
+        "temperature": 0.4,
+    }
+
+    start = time.monotonic()
+    r = session.post(f"{endpoint}/v1/chat/completions",
+                     headers={"Content-Type": "application/json"},
+                     json=payload,
+                     verify=verify_tls,
+                     timeout=300)
+    elapsed_ms = int((time.monotonic() - start) * 1000)
+    r.raise_for_status()
+    data = r.json()
+
+    msg = data["choices"][0]["message"]
+    reasoning, answer = parse_reasoning_response(msg)
+    label = extract_json_from_answer(answer)
+
+    return {
+        "image": str(image_path),
+        "elapsed_ms": elapsed_ms,
+        "completion_tokens": data["usage"]["completion_tokens"],
+        "label": label,
+        "reasoning": reasoning,
+        "raw_answer": answer if not label else None,
+    }
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--endpoint", default=os.environ.get("ENDPOINT", "http://localhost:8000"))
+    parser.add_argument("--model", default=os.environ.get("MODEL_ID", "nvidia/Cosmos-Reason1-7B"))
+    parser.add_argument("--image-dir", required=True, help="Directory containing images to label")
+    parser.add_argument("--output", default="labels.jsonl", help="JSONL output path")
+    parser.add_argument("--schema", help="Path to a JSON file with the label schema (overrides default)")
+    parser.add_argument("--max-tokens", type=int, default=800)
+    parser.add_argument("--limit", type=int, default=0,
+                        help="Process at most N images (0 = unlimited)")
+    parser.add_argument("--max-retries", type=int, default=3,
+                        help="Max retries per image on transient HTTP errors (429/502/503/504)")
+    parser.add_argument("--insecure", action="store_true",
+                        help="Disable TLS certificate verification (for self-signed certs)")
+    args = parser.parse_args()
+
+    verify_tls = not args.insecure
+    if args.insecure:
+        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+
+    session = make_session(args.max_retries)
+
+    schema = DEFAULT_SCHEMA
+    if args.schema:
+        with open(args.schema) as f:
+            schema = json.load(f)
+
+    image_dir = Path(args.image_dir)
+    images = sorted([p for p in image_dir.iterdir()
+                     if p.suffix.lower() in {".jpg", ".jpeg", ".png", ".webp"}])
+    if args.limit > 0:
+        images = images[:args.limit]
+
+    if not images:
+        print(f"No images found in {image_dir}", file=sys.stderr)
+        return 1
+
+    print(f"Labeling {len(images)} images against {args.endpoint} ({args.model})...")
+
+    with open(args.output, "w") as out:
+        for i, img in enumerate(images, 1):
+            try:
+                result = label_image(img, args.endpoint, args.model, schema,
+                                     args.max_tokens, session, verify_tls)
+                out.write(json.dumps(result) + "\n")
+                out.flush()
+                ok = "OK" if result["label"] else "PARSE_FAILED"
+                print(f"  [{i}/{len(images)}] {img.name} {ok} ({result['elapsed_ms']} ms)")
+            except Exception as exc:  # noqa: BLE001
+                err = {"image": str(img), "error": str(exc)}
+                out.write(json.dumps(err) + "\n")
+                out.flush()
+                print(f"  [{i}/{len(images)}] {img.name} ERROR: {exc}", file=sys.stderr)
+
+    print(f"\nWrote {args.output}")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/download_samples.sh b/3.test_cases/pytorch/vllm/cosmos-reason/examples/download_samples.sh
new file mode 100755
index 000000000..40fd0f753
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/download_samples.sh
@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+#
+# Download sample media for the Cosmos Reason example clients.
+#
+# Image: Unsplash (Unsplash License — free for commercial and non-commercial use)
+# Video: Wikimedia Commons (CC BY 3.0)
+set -euo pipefail
+
+cd "$(dirname "$0")"
+
+echo "Downloading sample.jpg (urban street scene from Unsplash)..."
+curl -L -o sample.jpg \
+  "https://images.unsplash.com/photo-1449824913935-59a10b8d2000?w=640"
+
+echo "Downloading sample video from Wikimedia Commons..."
+curl -L -o sample_meteor.webm \
+  "https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/2013_Russian_meteor_event_(Magnitogorsk).webm"
+
+echo ""
+echo "Downloaded:"
+ls -lh sample.jpg sample_meteor.webm
+echo ""
+echo "Run examples:"
+echo "  python3 image_vqa.py --image sample.jpg"
+echo "  python3 video_qa.py --video sample_meteor.webm"
+echo "  python3 auto_label.py --image-dir . --limit 1"
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/image_vqa.py b/3.test_cases/pytorch/vllm/cosmos-reason/examples/image_vqa.py
new file mode 100755
index 000000000..03732e0b6
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/image_vqa.py
@@ -0,0 +1,119 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+"""
+image_vqa.py — Single-image visual question answering against a Cosmos Reason vLLM endpoint.
+
+Use cases: drive-recorder review, content moderation, scene understanding.
+
+Example:
+    python3 image_vqa.py --image sample.jpg \
+        --prompt "What is happening in this scene? Reason about the visible cues."
+"""
+
+import argparse
+import base64
+import json
+import os
+import re
+import sys
+import time
+from pathlib import Path
+from typing import Optional, Tuple
+
+import requests
+import urllib3
+
+
+def encode_image(path: str) -> str:
+    suffix = Path(path).suffix.lstrip(".").lower() or "jpeg"
+    if suffix == "jpg":
+        suffix = "jpeg"
+    with open(path, "rb") as f:
+        b64 = base64.b64encode(f.read()).decode("ascii")
+    return f"data:image/{suffix};base64,{b64}"
+
+
+def parse_reasoning_response(message: dict) -> Tuple[Optional[str], str]:
+    """Return (reasoning_trace, answer) for both Reason1 (inline <think>) and Reason2
+    (separate reasoning_content) response shapes."""
+    reasoning = message.get("reasoning_content") or message.get("reasoning")
+    content = message.get("content") or ""
+
+    if reasoning:
+        return reasoning.strip(), content.strip()
+
+    # Reason1 path — <think>...</think> inline in content
+    think_match = re.search(r"<think>\s*(.*?)\s*</think>", content, re.DOTALL)
+    answer_match = re.search(r"<answer>\s*(.*?)\s*</answer>", content, re.DOTALL)
+    if think_match:
+        trace = think_match.group(1).strip()
+        if answer_match:
+            return trace, answer_match.group(1).strip()
+        # No <answer> tag — return the rest of content after </think>
+        rest = content[think_match.end():].strip()
+        return trace, rest
+    return None, content.strip()
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--endpoint", default=os.environ.get("ENDPOINT", "http://localhost:8000"))
+    parser.add_argument("--model", default=os.environ.get("MODEL_ID", "nvidia/Cosmos-Reason1-7B"))
+    parser.add_argument("--image", required=True, help="Path to local image file")
+    parser.add_argument("--prompt", default="What is in this image, and what is happening? Reason about visible cues.")
+    parser.add_argument("--max-tokens", type=int, default=512)
+    parser.add_argument("--temperature", type=float, default=0.6)
+    parser.add_argument("--system-prompt",
+                        default="Answer in <think>your reasoning</think><answer>your answer</answer> format.")
+    parser.add_argument("--insecure", action="store_true",
+                        help="Disable TLS certificate verification (for self-signed certs)")
+    args = parser.parse_args()
+
+    verify_tls = not args.insecure
+    if args.insecure:
+        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+
+    image_url = encode_image(args.image)
+    payload = {
+        "model": args.model,
+        "messages": [
+            {"role": "system", "content": args.system_prompt},
+            {"role": "user", "content": [
+                {"type": "image_url", "image_url": {"url": image_url}},
+                {"type": "text", "text": args.prompt},
+            ]},
+        ],
+        "max_tokens": args.max_tokens,
+        "temperature": args.temperature,
+    }
+
+    start = time.monotonic()
+    r = requests.post(f"{args.endpoint}/v1/chat/completions",
+                      headers={"Content-Type": "application/json"},
+                      json=payload,
+                      verify=verify_tls,
+                      timeout=300)
+    elapsed_ms = int((time.monotonic() - start) * 1000)
+    r.raise_for_status()
+    data = r.json()
+
+    msg = data["choices"][0]["message"]
+    reasoning, answer = parse_reasoning_response(msg)
+
+    print(f"=== Response ({elapsed_ms} ms, {data['usage']['completion_tokens']} tokens) ===")
+    print()
+    if reasoning:
+        print("--- Reasoning ---")
+        print(reasoning)
+        print()
+    print("--- Answer ---")
+    print(answer)
+    print()
+    print("--- Usage ---")
+    print(json.dumps(data["usage"], indent=2))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/video_qa.py b/3.test_cases/pytorch/vllm/cosmos-reason/examples/video_qa.py
new file mode 100755
index 000000000..150490e77
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/video_qa.py
@@ -0,0 +1,102 @@
+#!/usr/bin/env python3
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+"""
+video_qa.py — Short video clip Q&A against a Cosmos Reason vLLM endpoint.
+
+Use case: AV scene understanding (Uber pattern), drive-recorder review, video moderation.
+
+Cosmos-Reason2 (Qwen3-VL) is video-native via `--media-io-kwargs '{"video":{"num_frames":-1}}'`.
+Cosmos-Reason1 (Qwen2.5-VL) uses `--limit-mm-per-prompt '{"image":10,"video":10}'`.
+
+Example:
+    python3 video_qa.py --video clip.mp4 \
+        --prompt "Describe the trajectory of the vehicle in this clip."
+"""
+
+import argparse
+import base64
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import requests
+import urllib3
+
+from image_vqa import parse_reasoning_response  # reuse the parser
+
+
+def encode_video(path: str) -> str:
+    suffix = Path(path).suffix.lstrip(".").lower() or "mp4"
+    with open(path, "rb") as f:
+        b64 = base64.b64encode(f.read()).decode("ascii")
+    return f"data:video/{suffix};base64,{b64}"
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--endpoint", default=os.environ.get("ENDPOINT", "http://localhost:8000"))
+    parser.add_argument("--model", default=os.environ.get("MODEL_ID", "nvidia/Cosmos-Reason1-7B"))
+    parser.add_argument("--video", required=True, help="Path to local video file (mp4 / webm)")
+    parser.add_argument("--prompt", default="Describe what is happening in this video. Reason about the temporal cues.")
+    parser.add_argument("--max-tokens", type=int, default=800)
+    parser.add_argument("--temperature", type=float, default=0.5)
+    parser.add_argument("--system-prompt",
+                        default="Answer in <think>your reasoning</think><answer>your answer</answer> format.")
+    parser.add_argument("--insecure", action="store_true",
+                        help="Disable TLS certificate verification (for self-signed certs)")
+    args = parser.parse_args()
+
+    verify_tls = not args.insecure
+    if args.insecure:
+        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+
+    if not os.path.exists(args.video):
+        print(f"ERROR: video not found at {args.video}", file=sys.stderr)
+        return 2
+
+    video_url = encode_video(args.video)
+    payload = {
+        "model": args.model,
+        "messages": [
+            {"role": "system", "content": args.system_prompt},
+            {"role": "user", "content": [
+                {"type": "video_url", "video_url": {"url": video_url}},
+                {"type": "text", "text": args.prompt},
+            ]},
+        ],
+        "max_tokens": args.max_tokens,
+        "temperature": args.temperature,
+    }
+
+    start = time.monotonic()
+    r = requests.post(f"{args.endpoint}/v1/chat/completions",
+                      headers={"Content-Type": "application/json"},
+                      json=payload,
+                      verify=verify_tls,
+                      timeout=300)
+    elapsed_ms = int((time.monotonic() - start) * 1000)
+    r.raise_for_status()
+    data = r.json()
+
+    msg = data["choices"][0]["message"]
+    reasoning, answer = parse_reasoning_response(msg)
+
+    print(f"=== Response ({elapsed_ms} ms, {data['usage']['completion_tokens']} tokens) ===")
+    print()
+    if reasoning:
+        print("--- Reasoning ---")
+        print(reasoning)
+        print()
+    print("--- Answer ---")
+    print(answer)
+    print()
+    print("--- Usage ---")
+    print(json.dumps(data["usage"], indent=2))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/README.md
new file mode 100644
index 000000000..bddb81d01
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/README.md
@@ -0,0 +1,167 @@
+# Cosmos Reason on the SageMaker HyperPod Inference Operator
+
+`InferenceEndpointConfig` CRD reference for serving Cosmos Reason on a SageMaker HyperPod
+EKS cluster using the [HyperPod Inference Operator](https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/).
+
+This path uses the **AWS-managed vLLM Deep Learning Container** (`vllm:0.17-gpu-py312`)
+with EFA, NCCL, and security patches pre-baked. The DLC tag `0.17-gpu-py312` corresponds
+to vLLM 0.17.0; the [`../kubernetes/`](../kubernetes/) path uses upstream
+`vllm/vllm-openai:v0.21.0` directly.
+
+For plain EKS clusters without HyperPod, or HyperPod Clusters without the Inference Operator, use the [`../kubernetes/`](../kubernetes/) path.
+
+## What's here
+
+| File | Purpose |
+|------|---------|
+| `endpoint.yaml` | `InferenceEndpointConfig` CRD spec |
+| `hf-token-secret.yaml.example` | Reference for HF token Secret (recommended path is `kubectl create`) |
+
+## Prerequisites
+
+- HyperPod EKS cluster with at least one GPU node
+- HyperPod Inference Operator installed. Three install paths, each with its own version scheme:
+  - **EKS Add-on (recommended)** — add-on `amazon-sagemaker-hyperpod-inference`, versioned
+    as `vX.Y.Z-eksbuild.N` (latest `v1.3.0-eksbuild.1` as of this writing). Install via the
+    HyperPod console (Add-ons → Inference Operator) or
+    `aws eks create-addon --addon-name amazon-sagemaker-hyperpod-inference`. Confirm the
+    current version for your region with
+    `aws eks describe-addon-versions --addon-name amazon-sagemaker-hyperpod-inference`.
+  - **`sagemaker-hyperpod-cli`** — CLI `v3.7.0+` with `hyp install`.
+  - **Helm chart** — `hyperpod-inference-operator` subchart `v2.1.0`, operator image `v3.1`.
+    Helm install may be deprecated in a future release in favor of the EKS Add-on.
+- The Inference Operator's prerequisite IRSA roles must be configured at install time
+  (the operator does NOT need per-endpoint IAM)
+- A TLS certificate output S3 bucket for endpoint certificate management
+  (auto-created at install time as `sagemaker-<HP_CLUSTER>-<ID>-tls-<ID>`)
+- `HF_TOKEN` with access to the model — see [parent README](../README.md#prerequisites)
+
+## Deploy
+
+```bash
+# 1. Source environment
+cd ..
+cp env_vars.example env_vars
+# Edit env_vars — set HF_TOKEN, INSTANCE_TYPE, etc.
+source env_vars
+
+
+
+# 2. Create the HF token Secret
+kubectl create secret generic hf-token \
+  --namespace=${NAMESPACE} \
+  --from-literal=token=${HF_TOKEN}
+
+# 3. Render and apply
+cd hyperpod-eks/
+envsubst < endpoint.yaml | kubectl apply -f -
+
+# 4. Watch the operator drive the deployment
+kubectl get inferenceendpointconfig cosmos-reason -w
+
+# Once status reports Ready, the endpoint URL is available:
+kubectl get inferenceendpointconfig cosmos-reason \
+  -o jsonpath='{.status.endpointUrl}'
+```
+
+First-launch is **5-10 minutes** — image pull (~23 GB AWS DLC) + HF model download +
+vLLM init. Bump `maxDeployTimeInSeconds` to `3600` if the default `1800` proves too short.
+
+You are ready to test when the SageMaker Endpoint is successfully created. You can check this with:
+
+```bash
+aws sagemaker describe-endpoint --endpoint-name cosmos-reason --region $AWS_REGION
+```
+
+## Test
+
+There are three ways to reach the deployed model:
+
+### Option 1: Port-forward (simplest, works from any machine)
+
+```bash
+kubectl port-forward deploy/cosmos-reason 8000:8080 &
+
+# Health check
+curl -s http://localhost:8000/health
+
+# Run an example
+cd ../examples
+python3 image_vqa.py --endpoint http://localhost:8000 --image sample.jpg --model "${MODEL_ID}" --system-prompt ""
+
+# Batch auto-label a directory of images
+python3 auto_label.py --endpoint http://localhost:8000 --image-dir ./scenes/ --model "${MODEL_ID}" --output labels.jsonl
+```
+
+### Option 2: Operator-managed ALB (in-VPC by default)
+
+The operator provisions an ALB with TLS. By default the ALB is internal (VPC-only), but it
+can be configured as internet-facing. If exposing publicly, ensure you have authentication
+and access controls in place (e.g., WAF, Cognito, or mutual TLS).
+
+```bash
+ENDPOINT=$(kubectl get inferenceendpointconfig cosmos-reason \
+  -o jsonpath='{.status.endpointUrl}')
+
+# Requires VPC connectivity; -k for self-signed cert
+curl -k "${ENDPOINT}/health"
+
+cd ../examples
+python3 image_vqa.py --endpoint "${ENDPOINT}" --image sample.jpg --model "${MODEL_ID}" --system-prompt ""
+
+# Batch auto-label
+python3 auto_label.py --endpoint "${ENDPOINT}" --image-dir ./scenes/ --model "${MODEL_ID}" --output labels.jsonl
+```
+
+> [!NOTE]
+> If `.status.endpointUrl` is empty, the operator's cert-manager integration may not have
+> completed. Verify with `kubectl get ingress -A` and `kubectl get pods -n cert-manager`.
+
+### Option 3: SageMaker Runtime API (works from anywhere, uses IAM auth)
+
+Invoke via the SageMaker runtime with AWS SigV4 signing — no VPC connectivity required.
+
+```bash
+echo '{"model":"'"${MODEL_ID}"'","messages":[{"role":"user","content":"What is happening in this scene? Reason about the visible cues."}],"max_tokens":64}' > /tmp/payload.json
+
+aws sagemaker-runtime invoke-endpoint \
+  --endpoint-name cosmos-reason \
+  --region ${AWS_REGION} \
+  --content-type application/json \
+  --body fileb:///tmp/payload.json \
+  /dev/stdout
+```
+
+For batch auto-labeling via the SageMaker Runtime, you would need to call `invoke-endpoint`
+per image with the appropriate payload. The example scripts (`image_vqa.py`, `video_qa.py`,
+`auto_label.py`) use plain HTTP requests and do not support SigV4 signing — use Option 1
+or 2 with those scripts.
+
+## Cleanup
+
+```bash
+envsubst < endpoint.yaml | kubectl delete -f -
+kubectl delete secret hf-token -n ${NAMESPACE}
+```
+
+## Operational notes
+
+- **First reference of the AWS vLLM DLC in this repo.** AWS launched a standalone vLLM
+  Deep Learning Container in late 2025 (separate from DJL-LMI). Image lives at
+  `763104351884.dkr.ecr.<region>.amazonaws.com/vllm:<tag>`. Tags:
+  - `vllm:0.17-gpu-py312` — vLLM 0.17.0
+  - `vllm:server-sagemaker-cuda-v1` — vLLM 0.19.1 (newer "server" tag with `SM_VLLM_*`
+    env-var auto-translation to CLI args)
+- **`maxDeployTimeInSeconds: 3600`** — default is 1800s (30 min) which is risky for
+  first deploys. Vendor image pull + model download + CUDA graph compile can hit 8 min.
+- **`invocationEndpoint: v1/chat/completions`** — overrides the legacy default of
+  `invocations`. Required for vLLM's OpenAI-compatible API.
+- **`modelInvocationPort.containerPort: 8080`** — matches the AWS DLC default port.
+  The upstream `vllm/vllm-openai` image uses 8000; the AWS DLC uses 8080.
+- **`tokenSecretRef`** under `huggingFaceModel` — the operator passes the secret to the
+  worker pod. Secret key MUST be `token` (not `HF_TOKEN`).
+- **No `JumpStartModel` path available** — Cosmos Reason is not in `SageMakerPublicHub`,
+  so we use the `InferenceEndpointConfig` BYO container CRD.
+- **Autoscaling** — `replicas: 1` here for simplicity. The operator has dual-layer
+  autoscaling (KEDA pod-level + Karpenter node-level) configurable via `autoScaling.*`
+  fields. See [operator docs](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment.html).
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/endpoint.yaml b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/endpoint.yaml
new file mode 100644
index 000000000..4f64cf0cc
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/endpoint.yaml
@@ -0,0 +1,82 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+---
+apiVersion: inference.sagemaker.aws.amazon.com/v1
+kind: InferenceEndpointConfig
+metadata:
+  name: cosmos-reason
+  namespace: ${NAMESPACE}
+spec:
+  endpointName: cosmos-reason
+  modelName: ${ENDPOINT_NAME}
+  instanceType: ${HYPERPOD_INSTANCE_TYPE}
+  invocationEndpoint: v1/chat/completions
+  # First deploy can take 5-10 min (23 GB AWS DLC pull + HF download + vLLM init).
+  # Default maxDeployTimeInSeconds is 1800; bump for headroom.
+  maxDeployTimeInSeconds: 3600
+  modelSourceConfig:
+    modelSourceType: huggingface
+    huggingFaceModel:
+      modelId: ${MODEL_ID}
+      tokenSecretRef:
+        name: hf-token
+        key: token
+  worker:
+    image: ${VLLM_IMAGE_AWS_DLC}
+    modelVolumeMount:
+      name: model-store
+      mountPath: /tmp/model
+    modelInvocationPort:
+      containerPort: 8080
+      name: http
+    resources:
+      requests:
+        cpu: "8"
+        memory: "64Gi"
+        nvidia.com/gpu: "${TENSOR_PARALLEL_SIZE}"
+      limits:
+        nvidia.com/gpu: "${TENSOR_PARALLEL_SIZE}"
+    args:
+      - "--model"
+      - "${MODEL_ID}"
+      - "--max-model-len"
+      - "${MAX_MODEL_LEN}"
+      - "--tensor-parallel-size"
+      - "${TENSOR_PARALLEL_SIZE}"
+      - "--gpu-memory-utilization"
+      - "${GPU_MEMORY_UTILIZATION}"
+      - "--dtype"
+      - "${DTYPE}"
+      # For Cosmos-Reason2 (Qwen3-VL), uncomment the next two args:
+      # - "--reasoning-parser"
+      # - "qwen3"
+      # For Reason2 video reasoning, uncomment:
+      # - "--media-io-kwargs"
+      # - '{"video":{"num_frames":-1}}'
+    environmentVariables:
+      - name: OMP_NUM_THREADS
+        value: "1"
+      - name: VLLM_USAGE_SOURCE
+        value: "awsome-distributed-ai-cosmos-reason"
+      # AWS DLC reads SM_VLLM_* env vars — these override the entrypoint defaults.
+      # CLI args above may be ignored by the DLC entrypoint; env vars are the
+      # reliable path.
+      - name: SM_VLLM_MAX_MODEL_LEN
+        value: "${MAX_MODEL_LEN}"
+      - name: SM_VLLM_TENSOR_PARALLEL_SIZE
+        value: "${TENSOR_PARALLEL_SIZE}"
+      - name: SM_VLLM_GPU_MEMORY_UTILIZATION
+        value: "${GPU_MEMORY_UTILIZATION}"
+      - name: SM_VLLM_DTYPE
+        value: "${DTYPE}"
+      # For Cosmos-Reason2 (Qwen3-VL), uncomment:
+      # - name: SM_VLLM_REASONING_PARSER
+      #   value: "qwen3"
+  # NOTE: nodeAffinity cannot be used when instanceType is set — the operator
+  # handles scheduling. Deep-health-check affinity is applied automatically.
+  # TLS for the operator-managed ALB. Bucket is created at operator install time.
+  # Override TLS_CERT_S3_URI in env_vars.
+  tlsConfig:
+    tlsCertificateOutputS3Uri: ${TLS_CERT_S3_URI}
+  loadBalancer:
+    healthCheckPath: /health
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/hf-token-secret.yaml.example b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/hf-token-secret.yaml.example
new file mode 100644
index 000000000..0995538d8
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/hf-token-secret.yaml.example
@@ -0,0 +1,19 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+#
+# Reference manifest. Recommended approach is `kubectl create secret`:
+#
+#   kubectl create secret generic hf-token \
+#     --namespace=${NAMESPACE} \
+#     --from-literal=token=${HF_TOKEN}
+#
+# Use this manifest only with sealed-secret / external-secret-operator.
+---
+apiVersion: v1
+kind: Secret
+metadata:
+  name: hf-token
+  namespace: ${NAMESPACE}
+type: Opaque
+stringData:
+  token: REPLACE_WITH_HF_TOKEN
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/README.md
new file mode 100644
index 000000000..3bce73e6e
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/README.md
@@ -0,0 +1,99 @@
+# Cosmos Reason on Vanilla EKS
+
+Plain Kubernetes `Deployment` + `Service` for any EKS cluster with GPU nodes.
+Uses the upstream `vllm/vllm-openai` container image directly.
+
+If you have HyperPod EKS and want managed scale-to-zero, KV cache, and intelligent
+routing, use the [`../hyperpod-eks/`](../hyperpod-eks/) path instead.
+
+## What's here
+
+| File | Purpose |
+|------|---------|
+| `deployment.yaml` | `Deployment` (pod spec) + `Service` (ClusterIP) |
+| `hf-token-secret.yaml.example` | Reference manifest — recommended path is `kubectl create secret` rather than apply (so the token never lands in version control) |
+
+## Prerequisites
+
+- EKS cluster with at least one GPU node
+- NVIDIA device plugin DaemonSet installed (Karpenter usually handles this)
+- `kubectl` configured with cluster context
+- `envsubst` (provided by `gettext`)
+- `HF_TOKEN` with access to the model — see [parent README](../README.md#prerequisites)
+
+## Deploy
+
+```bash
+# 1. Source environment
+cd ..
+cp env_vars.example env_vars
+# Edit env_vars — set HF_TOKEN at minimum
+source env_vars
+
+# 2. Create the HF token Secret
+kubectl create secret generic hf-token \
+  --namespace=${NAMESPACE} \
+  --from-literal=token=${HF_TOKEN}
+
+# 3. Render and apply the manifests
+cd kubernetes/
+envsubst < deployment.yaml | kubectl apply -f -
+
+# 4. Wait for the Pod to become Ready
+kubectl wait --for=condition=Ready pod \
+  -l app=cosmos-reason \
+  --namespace=${NAMESPACE} \
+  --timeout=10m
+
+# 5. Verify
+kubectl logs -l app=cosmos-reason --tail=20
+```
+
+First-launch is **3-8 minutes** (image pull + HF model download + vLLM init).
+
+## Test
+
+```bash
+# Port-forward to localhost
+kubectl port-forward -n ${NAMESPACE} svc/cosmos-reason 8000:8000 &
+
+# Hit /health
+curl -s http://localhost:8000/health
+
+# List the loaded model
+curl -s http://localhost:8000/v1/models | jq
+
+# Try an example
+# This asks the model "What is in this image, and what is happening? Reason about visible cues."
+cd ../examples
+python3 image_vqa.py --image sample.jpg --model ${MODEL_ID}
+```
+
+You can also customize the question:
+```bash
+python3 image_vqa.py --image sample.jpg --prompt "How many vehicles are visible and what types are they?"
+```
+
+Test with auto labeling use case:
+```bash
+python3 auto_label.py --image-dir ./scenes/ --output labels.jsonl --limit 5
+```
+
+## Cleanup
+
+```bash
+envsubst < deployment.yaml | kubectl delete -f -
+kubectl delete secret hf-token -n ${NAMESPACE}
+```
+
+## Notes
+
+- The `Service` is `ClusterIP`. To expose externally, add an `Ingress` (ALB recommended)
+  or change to `LoadBalancer`.
+- **Autoscaling:** No HPA is included by default. Inference is GPU-bound and CPU-based
+  scaling is not a useful proxy for queue depth. For production, configure
+  [KEDA](https://keda.sh/) on the vLLM Prometheus metric `vllm:num_requests_running`,
+  or pair with Karpenter for node-level scale-out.
+- `/dev/shm` is mounted via `emptyDir { medium: Memory }` (per the
+  [`awsome-distributed-ai`](https://github.com/awslabs/awsome-distributed-ai) review
+  conventions — never `hostPath: /dev/shm`).
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/deployment.yaml b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/deployment.yaml
new file mode 100644
index 000000000..693730a52
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/deployment.yaml
@@ -0,0 +1,147 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: cosmos-reason
+  namespace: ${NAMESPACE}
+  labels:
+    app: cosmos-reason
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: cosmos-reason
+  strategy:
+    type: Recreate
+  template:
+    metadata:
+      labels:
+        app: cosmos-reason
+    spec:
+      restartPolicy: Always
+      affinity:
+        nodeAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            nodeSelectorTerms:
+              - matchExpressions:
+                  - key: node.kubernetes.io/instance-type
+                    operator: In
+                    values:
+                      - ${INSTANCE_TYPE}
+                      - ${HYPERPOD_INSTANCE_TYPE}
+          # Optional: prefer nodes that passed HyperPod deep-health checks.
+          # Harmless on plain EKS (label simply won't match).
+          preferredDuringSchedulingIgnoredDuringExecution:
+            - weight: 100
+              preference:
+                matchExpressions:
+                  - key: sagemaker.amazonaws.com/deep-health-check-status
+                    operator: In
+                    values:
+                      - Passed
+      tolerations:
+        # HyperPod nodes carry this taint; harmless on plain EKS.
+        - key: sagemaker.amazonaws.com/node-health-status
+          operator: Equal
+          value: Schedulable
+          effect: NoSchedule
+      containers:
+        - name: vllm
+          image: ${VLLM_IMAGE_VANILLA}
+          imagePullPolicy: IfNotPresent
+          args:
+            - "--model"
+            - "${MODEL_ID}"
+            - "--host"
+            - "0.0.0.0"
+            - "--port"
+            - "${INVOCATION_PORT}"
+            - "--max-model-len"
+            - "${MAX_MODEL_LEN}"
+            - "--tensor-parallel-size"
+            - "${TENSOR_PARALLEL_SIZE}"
+            - "--gpu-memory-utilization"
+            - "${GPU_MEMORY_UTILIZATION}"
+            - "--dtype"
+            - "${DTYPE}"
+            - "--limit-mm-per-prompt"
+            - '{"video":2,"image":10}'
+            - "--mm-processor-kwargs"
+            - '{"max_pixels":20000000,"fps":1.0}'
+            # For Cosmos-Reason2 (Qwen3-VL), uncomment the following two args:
+            # - "--reasoning-parser"
+            # - "qwen3"
+            # For Reason2 video reasoning, uncomment:
+            # - "--media-io-kwargs"
+            # - '{"video":{"num_frames":-1}}'
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token
+                  key: token
+            - name: HUGGING_FACE_HUB_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token
+                  key: token
+            - name: VLLM_USAGE_SOURCE
+              value: "awsome-distributed-ai-cosmos-reason"
+          ports:
+            - containerPort: ${INVOCATION_PORT}
+              name: http
+              protocol: TCP
+          resources:
+            requests:
+              nvidia.com/gpu: ${TENSOR_PARALLEL_SIZE}
+              cpu: "4"
+              memory: "32Gi"
+            limits:
+              nvidia.com/gpu: ${TENSOR_PARALLEL_SIZE}
+          volumeMounts:
+            - name: shmem
+              mountPath: /dev/shm
+            - name: hf-cache
+              mountPath: /root/.cache/huggingface
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: ${INVOCATION_PORT}
+            initialDelaySeconds: 240
+            periodSeconds: 15
+            timeoutSeconds: 5
+            failureThreshold: 80
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: ${INVOCATION_PORT}
+            initialDelaySeconds: 900
+            periodSeconds: 60
+            timeoutSeconds: 10
+      volumes:
+        - name: shmem
+          emptyDir:
+            medium: Memory
+            sizeLimit: 8Gi
+        - name: hf-cache
+          emptyDir:
+            sizeLimit: 50Gi
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: cosmos-reason
+  namespace: ${NAMESPACE}
+  labels:
+    app: cosmos-reason
+spec:
+  type: ClusterIP
+  selector:
+    app: cosmos-reason
+  ports:
+    - name: http
+      port: ${INVOCATION_PORT}
+      targetPort: ${INVOCATION_PORT}
+      protocol: TCP
diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/hf-token-secret.yaml.example b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/hf-token-secret.yaml.example
new file mode 100644
index 000000000..cb3427006
--- /dev/null
+++ b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/hf-token-secret.yaml.example
@@ -0,0 +1,23 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: MIT-0
+#
+# This is a REFERENCE manifest. The recommended way to create the secret is:
+#
+#   kubectl create secret generic hf-token \
+#     --namespace=${NAMESPACE} \
+#     --from-literal=token=${HF_TOKEN}
+#
+# Using `kubectl create` keeps the token out of any version-controlled file. Apply this
+# manifest only if your operational model requires GitOps-managed secrets — and in that
+# case, replace the placeholder below with a sealed-secret / external-secret reference.
+---
+apiVersion: v1
+kind: Secret
+metadata:
+  name: hf-token
+  namespace: ${NAMESPACE}
+type: Opaque
+stringData:
+  # DO NOT commit this with a real token. Use a sealed-secret / external-secret-operator
+  # reference, or create the secret imperatively with `kubectl create secret`.
+  token: REPLACE_WITH_HF_TOKEN