diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/README.md new file mode 100644 index 000000000..629e33db8 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/README.md @@ -0,0 +1,250 @@ +# NVIDIA Cosmos Reason — vLLM Inference on Amazon EKS / HyperPod + +Online inference reference for NVIDIA's [Cosmos Reason](https://www.nvidia.com/en-us/ai/cosmos/) +physical-reasoning Vision-Language Model, served by [vLLM](https://github.com/vllm-project/vllm) +on Amazon EKS or SageMaker HyperPod EKS. + +Cosmos Reason is the physical-reasoning member of NVIDIA's Cosmos World Foundation Model +platform. It generates chain-of-thought reasoning traces about safety, causality, object +interactions, and spatiotemporal dynamics in videos and images — the building block for +auto-labeling Synthetic Data Generation (SDG) pipelines +([as adopted by Uber per NVIDIA SIGGRAPH 2025](https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/)), +Reasoning VLA models, and video Q&A workloads. + +## Production Use Cases + +| Use Case | Pattern | Example | +|----------|---------|---------| +| AV training data labeling | SDG critic loop — auto-label scenes with structured metadata (objects, hazards, weather) | [AV data captioning (NVIDIA SIGGRAPH 2025)](https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/) | +| Video scene understanding | Short clip Q&A with chain-of-thought physical reasoning | Dashcam review, warehouse safety monitoring | +| Image VQA with physical reasoning | Single-image spatial/causal analysis | Content moderation, quality inspection | +| SDG critic / verifier | Judge whether Cosmos Predict outputs are physically plausible before entering training sets | Synthetic data filtering | +| RL reward model | Score trajectories for physical coherence as a reward signal in RLHF or model-based RL | Robotics policy training, AV planning | +| Offline RL annotation | Label trajectory data with reasoning traces for offline RL training | Decision Transformer reward labels | + +## Two Paths +This test case provides **two parallel deployment paths**, both `kubectl`-only: + +``` + Cosmos Reason on EKS + | + +---------------+---------------+ + | | + kubernetes/ hyperpod-eks/ + (vanilla EKS Deployment + (HyperPod Inference Operator + Service + HPA, upstream InferenceEndpointConfig CRD, + vllm/vllm-openai image) AWS-managed vLLM DLC image) + | | + +---------------+---------------+ + | + OpenAI-compatible + /v1/chat/completions + | + examples/ + image VQA / video VQA / SDG auto-label +``` + +### Why two paths? + +| Path | Purpose | +|------|---------| +| `kubernetes/` | Plain EKS users who already have a vLLM deployment. No HyperPod required. Vendor `vllm/vllm-openai` image. | +| `hyperpod-eks/` | HyperPod EKS clusters using the [Inference Operator](https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/) — auto KEDA + Karpenter scale-to-zero, managed KV cache, intelligent routing. AWS-managed `vllm` Deep Learning Container. | + +Both paths serve the **same** model with the **same** vLLM CLI args. Pick the path that +matches your platform. + +## Prerequisites + +| Requirement | Detail | +|-------------|--------| +| EKS cluster | Kubernetes ≥ 1.28, GPU-capable | +| GPU node | One of: g5.* (A10G 24 GB), g6.12xlarge (4× L4 24 GB), g6e.* (L40S 48 GB), p4d/p5/p5e (H100/H200) | +| NVIDIA device plugin | `nvidia-device-plugin` DaemonSet running on GPU nodes | +| Hugging Face token | Required — Cosmos Reason models are gated on HF (NVIDIA Open Model License acceptance). [Request access](https://huggingface.co/nvidia/Cosmos-Reason1-7B) on the model card first. | +| For `hyperpod-eks/` only | HyperPod Inference Operator installed. Recommended: EKS add-on `amazon-sagemaker-hyperpod-inference` (`v1.3.0-eksbuild.1`). Alternatives: `sagemaker-hyperpod-cli` v3.7.0+ (`hyp install`), or Helm chart `hyperpod-inference-operator` v2.1.0 (operator image `v3.1`). See [hyperpod-eks/README.md](hyperpod-eks/README.md#prerequisites). | + +## Models + +This test case is parameterized via `${MODEL_ID}` in `env_vars.example`. Supported models: + +| Model | Backbone | Min GPU memory | Recommended `--max-model-len` | Reasoning parser | +|-------|----------|---------------|------------------------------|------------------| +| **`nvidia/Cosmos-Reason1-7B`** (default for sm_8x GPUs) | Qwen2.5-VL | 24 GB (A10G/L4 OK) | 24576 (24 GB GPUs) / 32768 (40+ GB) | None — `` is inline in `content` | +| **`nvidia/Cosmos-Reason2-2B`** | Qwen3-VL | 24 GB | 16384 | `--reasoning-parser qwen3` (separates `` into `reasoning_content`) | +| **`nvidia/Cosmos-Reason2-8B`** (default for sm_9x GPUs) | Qwen3-VL | 32 GB (L40S/H100/H200) | 16384 | `--reasoning-parser qwen3` | + +> [!NOTE] +> NVIDIA validates Cosmos Reason on H100 / GB200 / DGX Spark / Jetson AGX Thor. +> A10G (sm_86) and L4 (sm_89) are not on NVIDIA's official validated list but work in +> practice; this test case has been empirically validated on g5.8xlarge (1× A10G 24 GB). + +## Validation Status + +The configurations below were tested end-to-end on a SageMaker HyperPod EKS cluster +in `us-west-2`. All three example clients (`image_vqa`, `video_qa`, `auto_label`) +were exercised against both deployment paths. + +| Path | Model | Hardware | image_vqa | video_qa | auto_label | +|------|-------|----------|-----------|----------|------------| +| `kubernetes/` | Reason1-7B | g5.8xlarge (A10G, TP=1) | 18.3 s | 28.9 s | 20.3 s | +| `kubernetes/` | Reason2-8B | g6.12xlarge (4× L4, TP=4) | 8.2 s | 16.9 s | 3.4 s | +| `hyperpod-eks/` | Reason1-7B | ml.g5.8xlarge (A10G, TP=1) | 21.3 s | unsupported¹ | 18.1 s | +| `hyperpod-eks/` | Reason2-8B | ml.g6.12xlarge (4× L4, TP=4) | 13.8 s | unsupported¹ | 4.5 s | + +¹ The AWS vLLM DLC `vllm:0.17-gpu-py312` (vLLM 0.17.0) does not expose +`--mm-processor-kwargs` through the Inference Operator. The sample 5.3 MB meteor +clip tokenizes to ~19K embedding tokens, exceeding the default pre-allocated +16384-token encoder cache. The `kubernetes/` path uses upstream vLLM v0.21.0 and +includes the necessary args by default — use it for video workloads. + +### Empirical findings + +- **Reason2-8B on g6.12xlarge (4× L4, TP=4) is 2–5× faster** than Reason1-7B on a + single A10G for image workloads. Speedup is driven by tensor parallelism and the + L4's higher fp16/bf16 throughput, not model size. +- **Video workloads need expanded encoder cache.** Default vLLM config (16384-token + cache) cannot handle the sample video; shipped `kubernetes/deployment.yaml` adds + `--limit-mm-per-prompt`, `--mm-processor-kwargs '{"max_pixels":20000000,"fps":1.0}'`, + and bumps `MAX_MODEL_LEN=24576` to make video work out of the box. +- **`--reasoning-parser qwen3` is Reason2-only.** Enabling it for Reason1 + (Qwen2.5-VL backbone) causes `RuntimeError: Engine core initialization failed`. + `hyperpod-eks/endpoint.yaml` ships with it commented out by default — uncomment + for Reason2. + +## Quick Start (vanilla EKS, vendor image) + +> [!IMPORTANT] +> **Video workloads require expanded encoder cache.** +> `kubernetes/deployment.yaml` ships with `--limit-mm-per-prompt`, +> `--mm-processor-kwargs '{"max_pixels":20000000,"fps":1.0}'`, and `MAX_MODEL_LEN=24576` +> so the sample video works without modification. If you swap in a larger or +> higher-resolution clip and see `400 exceeds the pre-allocated encoder cache size`, +> increase `max_pixels`. The `hyperpod-eks/` DLC path does not currently expose +> these flags through the Inference Operator and is recommended for image workloads only. + +```bash +# 1. Set environment +cp env_vars.example env_vars +# Edit env_vars — at minimum set HF_TOKEN +source env_vars + +# 2. Validate required variables are set +for v in INSTANCE_TYPE MODEL_ID NAMESPACE TENSOR_PARALLEL_SIZE \ + MAX_MODEL_LEN GPU_MEMORY_UTILIZATION DTYPE INVOCATION_PORT \ + VLLM_IMAGE_VANILLA HF_TOKEN; do + [ -z "${!v}" ] && echo "ERROR: \$$v is unset" && exit 1 +done + +# 3. Create HF token Secret +kubectl create secret generic hf-token \ + --from-literal=token=${HF_TOKEN} + +# 4. Dry-run to confirm manifest renders correctly +envsubst < kubernetes/deployment.yaml | kubectl apply --dry-run=client -f - + +# 5. Deploy +envsubst < kubernetes/deployment.yaml | kubectl apply -f - + +# 6. Wait for /health (3-5 min for first launch — HF download + CUDA graph) +kubectl wait --for=condition=Ready pod -l app=cosmos-reason --timeout=10m + +# 7. Port-forward and test +kubectl port-forward deploy/cosmos-reason 8000:8000 & +python3 examples/image_vqa.py --image examples/sample.jpg +``` + +> [!NOTE] +> The default `/` system prompt and `--reasoning-parser qwen3` are +> mutually exclusive: +> - **Reason1 (Qwen2.5-VL):** Keep the default system prompt. Do NOT enable the +> reasoning parser. `...` appears inline in `content`. +> - **Reason2 (Qwen3-VL):** Omit the system prompt (`--system-prompt ""`) and +> enable `--reasoning-parser qwen3`. Reasoning is split into `reasoning_content`. +> Leaving the system prompt on can cause Reason2 to produce minimal answers. + +## Quick Start (HyperPod Inference Operator) + +```bash +# 1. Set environment (same env_vars as above) +source env_vars + +# 2. Create HF token Secret in the same namespace as the InferenceEndpointConfig +kubectl create secret generic hf-token \ + --from-literal=token=${HF_TOKEN} + +# 3. Deploy via the operator +envsubst < hyperpod-eks/endpoint.yaml | kubectl apply -f - + +# 4. Wait for the operator to mark the endpoint Ready +kubectl get inferenceendpointconfigs -w + +# 5. Invoke through the operator-managed ALB (or via SageMaker SDK) +python3 examples/image_vqa.py \ + --endpoint $(kubectl get inferenceendpointconfig cosmos-reason -o jsonpath='{.status.endpointUrl}') \ + --image examples/sample.jpg +``` + +## Use Cases (verified examples in `examples/`) + +| Script | What it does | +|--------|--------------| +| `examples/image_vqa.py` | Single-image visual question answering. Pattern: drive-recorder review, content moderation. | +| `examples/video_qa.py` | Short video clip Q&A. Pattern: AV scene understanding (Uber pattern). | +| `examples/auto_label.py` | Batch auto-labeling with `......` schema. Pattern: SDG critic loop, training data curation. | + +## Configuration Reference + +| Variable | Default | Purpose | +|----------|---------|---------| +| `MODEL_ID` | `nvidia/Cosmos-Reason1-7B` | HF model ID. Override to `Cosmos-Reason2-8B` on L40S/H100. | +| `VLLM_IMAGE_VANILLA` (kubernetes) | `vllm/vllm-openai:v0.21.0` | Upstream vLLM container. Pin to a specific version, never `:latest`. | +| `VLLM_IMAGE_AWS_DLC` (hyperpod-eks) | `vllm:0.17-gpu-py312` (AWS DLC) | AWS-managed vLLM DLC. ECR path: `763104351884.dkr.ecr.${AWS_REGION}.amazonaws.com/vllm:0.17-gpu-py312` | +| `INSTANCE_TYPE` | `ml.g5.8xlarge` | A10G 24 GB. Other validated: `ml.g6.12xlarge` (4×L4), `ml.g6e.4xlarge` (1×L40S 48 GB). | +| `MAX_MODEL_LEN` | `24576` | Sized for video out-of-the-box on 24 GB GPUs. Reduce to `8192` if OOM during CUDA graph capture; increase to `32768` on 40 GB+ GPUs. Cosmos Reason native context is 256K — must reduce for non-H100 hardware. | +| `GPU_MEMORY_UTILIZATION` | `0.92` | vLLM target memory headroom. Reduce to `0.85` if OOM during CUDA graph capture. | +| `TENSOR_PARALLEL_SIZE` | `1` | Single GPU for 7B/2B. Set to `4` for 8B on g6.12xlarge (4× L4 24 GB). | +| `NAMESPACE` | `default` | Kubernetes namespace | +| `HF_TOKEN` | (none — required) | Hugging Face token with model access. Stored as a `Secret`. | + +## Cleanup + +```bash +# Vanilla EKS path +kubectl delete -f kubernetes/ + +# HyperPod Inference Operator path +kubectl delete -f hyperpod-eks/ + +# Both paths +kubectl delete secret hf-token +``` + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `GatedRepoError: 401` on first deploy | No `HF_TOKEN` provided | Set `HF_TOKEN` env var, recreate `hf-token` Secret | +| `GatedRepoError: 403` after providing token | Token valid but account not on access list | Visit the model card on HF and click "Request Access". NVIDIA-gated models require accepting the NVIDIA Open Model License. | +| Pod stuck in `ContainerCreating` for 5+ min | Image pull (12 GB vendor / 23 GB AWS DLC) | Normal on first deploy. Check `kubectl describe pod` for "Pulling image". | +| Pod `Running` but `/health` returns 404 | vLLM still loading model + compiling CUDA graphs | Wait. First-launch is 3-8 min on 7B-class models. With `--enforce-eager` skips CUDA graphs (faster startup, slower inference). | +| `OutOfMemoryError` during CUDA graph capture | `--gpu-memory-utilization` too high or `--max-model-len` too long | Drop `--gpu-memory-utilization` to `0.85`, drop `--max-model-len` to `4096`. | +| Flash-Attention `headdim not multiple of 32` error (Reason2 / Qwen3-VL only) | vLLM internal fork of FA rejects Qwen3-VL ViT head dims | Do NOT set `VLLM_ATTENTION_BACKEND=FLASH_ATTN`. Let vLLM auto-pick. Issue [#27562](https://github.com/vllm-project/vllm/issues/27562) closed Apr 2026; v0.21.0+ is fixed. | +| Reason2 returns `` text inline in `content` (not separate `reasoning_content`) | Missing `--reasoning-parser qwen3` | Add `--reasoning-parser qwen3` to vLLM args. Required for Reason2; not applicable to Reason1. | +| Latency low / TTFT high | `--enforce-eager` skips CUDA graphs | Remove `--enforce-eager` to enable graph compilation. Adds ~2 min to startup but ~30% throughput improvement. | +| `400 The decoder prompt contains a(n) video item with X embedding tokens, which exceeds the pre-allocated encoder cache size` | Default encoder cache too small for the input video (16384 for Reason1, ~5000 for Reason2-8B) | On the `kubernetes/` path, raise `--mm-processor-kwargs '{"max_pixels":...,"fps":1.0}'` until cache > video tokens. On `hyperpod-eks/`, use a shorter clip (≤5 s @ 480p) or switch to `kubernetes/`. | +| `RuntimeError: Engine core initialization failed` after model load on `hyperpod-eks/` with Reason1 | `--reasoning-parser qwen3` enabled but Reason1 uses Qwen2.5-VL backbone (parser is Qwen3-only) | Comment out `--reasoning-parser qwen3` and `SM_VLLM_REASONING_PARSER=qwen3` in `hyperpod-eks/endpoint.yaml`. Re-enable only for Reason2. | +| `kubectl get secret hf-token -o jsonpath='{.data.token}' \| base64 -d` returns `REPLACE_WITH_HF_TOKEN` | Applied `hf-token-secret.yaml.example` directly without replacing the literal placeholder (it is not an envsubst template) | Use `kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN` per Quick Start. The `.example` YAML is reference-only. | +| Video request 400 on `hyperpod-eks/` even with short clip | DLC v0.17.0 does not surface `--mm-processor-kwargs` or `--limit-mm-per-prompt` to the Inference Operator | Use the `kubernetes/` path for video workloads. The DLC ships with a fixed encoder-cache budget; expanding it requires a custom DLC image or a newer DLC tag when available. | + +## References + +- NVIDIA Cosmos: https://www.nvidia.com/en-us/ai/cosmos/ +- Cosmos Reason1-7B model card: https://huggingface.co/nvidia/Cosmos-Reason1-7B +- Cosmos Reason2-8B model card: https://huggingface.co/nvidia/Cosmos-Reason2-8B +- Cosmos Reason2 repo (NVIDIA): https://github.com/nvidia-cosmos/cosmos-reason2 +- vLLM: https://github.com/vllm-project/vllm +- vLLM Qwen3-VL recipe: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html +- AWS vLLM DLC repo: https://github.com/aws/deep-learning-containers +- HyperPod Inference Operator setup blog: https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/ +- HyperPod Inference Operator best practices: https://aws.amazon.com/blogs/machine-learning/best-practices-to-run-inference-on-amazon-sagemaker-hyperpod/ diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/env_vars.example b/3.test_cases/pytorch/vllm/cosmos-reason/env_vars.example new file mode 100644 index 000000000..62ebc16d3 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/env_vars.example @@ -0,0 +1,63 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# Source this file with: source env_vars +# All variables here are consumed by the kubernetes/ and hyperpod-eks/ manifests +# (rendered with `envsubst`) and by the examples/ scripts. + +# ---- AWS / cluster context ---- +export AWS_REGION="us-west-2" +export AWS_ACCOUNT_ID="123456789012" +export NAMESPACE="default" + +# ---- Model selection ---- +# Default: Cosmos-Reason1-7B (Qwen2.5-VL backbone). Fits on A10G 24 GB / L4 24 GB / L40S 48 GB. +# Alternates: +# nvidia/Cosmos-Reason2-2B (Qwen3-VL, 2B params, also fits 24 GB GPUs) +# nvidia/Cosmos-Reason2-8B (Qwen3-VL, 8B params, requires ≥32 GB GPU) +export MODEL_ID="nvidia/Cosmos-Reason1-7B" + +# Cosmos Reason models are gated on Hugging Face — accept terms on the model card first, +# then create a token at https://huggingface.co/settings/tokens with read access. +# DO NOT commit this value. Pass it via your shell or the CI secret store. +export HF_TOKEN="" + +# ---- vLLM container ---- +# kubernetes/ (vanilla EKS) — upstream image +export VLLM_IMAGE_VANILLA="vllm/vllm-openai:v0.21.0" + +# hyperpod-eks/ (Inference Operator) — AWS-managed vLLM DLC +# Tags: vllm:0.17-gpu-py312 (vLLM 0.17.0) | vllm:server-sagemaker-cuda-v1 (vLLM 0.19.1) +export VLLM_IMAGE_AWS_DLC="763104351884.dkr.ecr.${AWS_REGION}.amazonaws.com/vllm:0.17-gpu-py312" + +# Additionally for HyperPod path: set the TLS bucket. You can find this in the HyperPod console within the `Inference` tab called `S3 bucket for TLS certificates` +export TLS_CERT_S3_URI="s3://hyperpod-tls-/certs" + +# ---- Hardware sizing ---- +# Validated combinations (model | GPU | TP | max-model-len): +# Cosmos-Reason1-7B | A10G 24G | TP=1 | 24576 (reduce to 8192 if OOM during CUDA graph capture) +# Cosmos-Reason1-7B | L4 24G | TP=1 | 24576 (reduce to 8192 if OOM during CUDA graph capture) +# Cosmos-Reason1-7B | L40S 48G | TP=1 | 32768 +# Cosmos-Reason2-2B | A10G/L4 | TP=1 | 16384 +# Cosmos-Reason2-8B | L40S 48G | TP=1 | 16384 +# Cosmos-Reason2-8B | g6.12xl | TP=4 | 16384 (4× L4, PCIe-only — slower than NVLink) +# Cosmos-Reason2-8B | H100 80G | TP=1 | 32768 +export INSTANCE_TYPE="g5.8xlarge" +export HYPERPOD_INSTANCE_TYPE="ml.${INSTANCE_TYPE}" +export TENSOR_PARALLEL_SIZE="1" +export MAX_MODEL_LEN="24576" +export GPU_MEMORY_UTILIZATION="0.92" + +# ---- vLLM serving args ---- +# DTYPE: bfloat16 is the only NVIDIA-tested precision for Cosmos Reason +export DTYPE="bfloat16" + +# ---- Reason2 (Qwen3-VL) vLLM args ---- +# If deploying Cosmos-Reason2-*, you must manually edit the deployment manifest: +# kubernetes/deployment.yaml: uncomment the --reasoning-parser and --media-io-kwargs lines +# hyperpod-eks/endpoint.yaml: uncomment the --reasoning-parser and SM_VLLM_REASONING_PARSER lines +# See the Troubleshooting section in README.md for details. + +# ---- Endpoint defaults ---- +export ENDPOINT_NAME="cosmos-reason" +export INVOCATION_PORT="8000" diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/.gitignore b/3.test_cases/pytorch/vllm/cosmos-reason/examples/.gitignore new file mode 100644 index 000000000..2980dad9d --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/.gitignore @@ -0,0 +1,8 @@ +# Downloaded sample media (fetched by download_samples.sh) +sample.jpg +sample_meteor.webm +*.webm +*.mp4 + +# JSONL output from auto_label.py +*.jsonl diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/examples/README.md new file mode 100644 index 000000000..859de6291 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/README.md @@ -0,0 +1,58 @@ +# Cosmos Reason — Client Examples + +Reference Python clients exercising three Cosmos Reason use cases against an OpenAI-compatible +vLLM endpoint. + +| Script | Use case | Latency target | +|--------|----------|---------------| +| `image_vqa.py` | Single-image visual Q&A | < 1 s for short reply | +| `video_qa.py` | Short video clip Q&A | 5-15 s | +| `auto_label.py` | SDG critic loop — `` reasoning + structured `` JSON | 10-30 s | + +## Setup + +```bash +pip install requests urllib3 + +# Download sample media (image + video) +./download_samples.sh + +# If Pod is in-cluster, port-forward first: +kubectl port-forward svc/cosmos-reason 8000:8000 & + +# OR set the operator-managed endpoint URL: +export ENDPOINT="https://cosmos-reason-.elb..amazonaws.com" +``` + +By default all scripts hit `http://localhost:8000`. Override with `--endpoint` or +`$ENDPOINT`. Use `--insecure` if the endpoint uses a self-signed TLS certificate +(e.g., operator-managed ALB). + +## Examples + +```bash +# Single image +python3 image_vqa.py --image sample.jpg \ + --prompt "What is the safety risk in this scene?" + +# Short video clip +python3 video_qa.py --video sample_meteor.webm \ + --prompt "Describe what is happening in this video." + +# Batch SDG auto-labeling (with retry on transient errors) +python3 auto_label.py --image-dir . --output labels.jsonl --limit 1 + +# With self-signed cert (operator-managed ALB) +python3 image_vqa.py --endpoint https://cosmos-reason.elb.us-west-2.amazonaws.com \ + --image sample.jpg --insecure +``` + +## Notes + +- Cosmos-Reason1 (Qwen2.5-VL) emits `......` inline + in the `content` field. The scripts here parse those tags. +- Cosmos-Reason2 (Qwen3-VL) with `--reasoning-parser qwen3` separates `` into + the response's `reasoning_content` field. The scripts handle both formats. +- `MODEL_ID` is read from `$MODEL_ID` env var, defaulting to `nvidia/Cosmos-Reason1-7B`. +- `auto_label.py` supports `--max-retries N` (default 3) for transient HTTP errors + (429, 502, 503, 504) with exponential backoff. diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/auto_label.py b/3.test_cases/pytorch/vllm/cosmos-reason/examples/auto_label.py new file mode 100755 index 000000000..3e93d6f84 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/auto_label.py @@ -0,0 +1,187 @@ +#!/usr/bin/env python3 +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +""" +auto_label.py — Synthetic Data Generation (SDG) auto-labeling using Cosmos Reason. + +Pattern: AV training-data captioning, as adopted by Uber +(https://blogs.nvidia.com/blog/nemotron-cosmos-reasoning-enterprise-physical-ai/). +Each input image gets a structured JSON label plus the model's chain-of-thought +reasoning trace. Useful for filtering implausible Cosmos-Predict outputs in an +SDG critic loop, or for bootstrapping training labels. + +Output: one JSON object per line (JSONL). + +Example: + python3 auto_label.py --image-dir ./scenes/ --output labels.jsonl + python3 auto_label.py --image-dir ./scenes/ --schema custom_schema.json +""" + +import argparse +import json +import os +import re +import sys +import time +from pathlib import Path +from typing import Optional + +import requests +import urllib3 +from requests.adapters import HTTPAdapter +from urllib3.util.retry import Retry + +from image_vqa import encode_image, parse_reasoning_response # reuse helpers + + +DEFAULT_SCHEMA = { + "scene": "string — short description", + "objects": "list[string] — primary visible objects", + "hazards": "list[string] — identified safety concerns", + "weather": "string — clear / rain / snow / fog / cloudy / unknown", + "time_of_day": "string — dawn / day / dusk / night / unknown", +} + + +def make_session(max_retries: int) -> requests.Session: + """Build a requests.Session with retry logic for transient HTTP errors.""" + session = requests.Session() + retry = Retry( + total=max_retries, + backoff_factor=1.0, + status_forcelist=[429, 502, 503, 504], + allowed_methods=["POST"], + ) + session.mount("http://", HTTPAdapter(max_retries=retry)) + session.mount("https://", HTTPAdapter(max_retries=retry)) + return session + + +def extract_json_from_answer(answer: str) -> Optional[dict]: + """Try hard to pull a JSON object out of the model's block.""" + if not answer: + return None + # JSON inside ```json ... ``` fence + fence = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", answer, re.DOTALL) + if fence: + try: + return json.loads(fence.group(1)) + except json.JSONDecodeError: + pass + # Bare JSON object + bare = re.search(r"\{.*\}", answer, re.DOTALL) + if bare: + try: + return json.loads(bare.group(0)) + except json.JSONDecodeError: + pass + return None + + +def label_image(image_path: Path, endpoint: str, model: str, schema: dict, + max_tokens: int, session: requests.Session, verify_tls: bool) -> dict: + image_url = encode_image(str(image_path)) + + system = ( + "You are auto-labeling driving scenes for AV training data. " + "Output your reasoning in ..., then output a JSON label " + f"in ... matching this schema: {json.dumps(schema)}" + ) + + payload = { + "model": model, + "messages": [ + {"role": "system", "content": system}, + {"role": "user", "content": [ + {"type": "image_url", "image_url": {"url": image_url}}, + {"type": "text", "text": "Label this scene."}, + ]}, + ], + "max_tokens": max_tokens, + "temperature": 0.4, + } + + start = time.monotonic() + r = session.post(f"{endpoint}/v1/chat/completions", + headers={"Content-Type": "application/json"}, + json=payload, + verify=verify_tls, + timeout=300) + elapsed_ms = int((time.monotonic() - start) * 1000) + r.raise_for_status() + data = r.json() + + msg = data["choices"][0]["message"] + reasoning, answer = parse_reasoning_response(msg) + label = extract_json_from_answer(answer) + + return { + "image": str(image_path), + "elapsed_ms": elapsed_ms, + "completion_tokens": data["usage"]["completion_tokens"], + "label": label, + "reasoning": reasoning, + "raw_answer": answer if not label else None, + } + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument("--endpoint", default=os.environ.get("ENDPOINT", "http://localhost:8000")) + parser.add_argument("--model", default=os.environ.get("MODEL_ID", "nvidia/Cosmos-Reason1-7B")) + parser.add_argument("--image-dir", required=True, help="Directory containing images to label") + parser.add_argument("--output", default="labels.jsonl", help="JSONL output path") + parser.add_argument("--schema", help="Path to a JSON file with the label schema (overrides default)") + parser.add_argument("--max-tokens", type=int, default=800) + parser.add_argument("--limit", type=int, default=0, + help="Process at most N images (0 = unlimited)") + parser.add_argument("--max-retries", type=int, default=3, + help="Max retries per image on transient HTTP errors (429/502/503/504)") + parser.add_argument("--insecure", action="store_true", + help="Disable TLS certificate verification (for self-signed certs)") + args = parser.parse_args() + + verify_tls = not args.insecure + if args.insecure: + urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) + + session = make_session(args.max_retries) + + schema = DEFAULT_SCHEMA + if args.schema: + with open(args.schema) as f: + schema = json.load(f) + + image_dir = Path(args.image_dir) + images = sorted([p for p in image_dir.iterdir() + if p.suffix.lower() in {".jpg", ".jpeg", ".png", ".webp"}]) + if args.limit > 0: + images = images[:args.limit] + + if not images: + print(f"No images found in {image_dir}", file=sys.stderr) + return 1 + + print(f"Labeling {len(images)} images against {args.endpoint} ({args.model})...") + + with open(args.output, "w") as out: + for i, img in enumerate(images, 1): + try: + result = label_image(img, args.endpoint, args.model, schema, + args.max_tokens, session, verify_tls) + out.write(json.dumps(result) + "\n") + out.flush() + ok = "OK" if result["label"] else "PARSE_FAILED" + print(f" [{i}/{len(images)}] {img.name} {ok} ({result['elapsed_ms']} ms)") + except Exception as exc: # noqa: BLE001 + err = {"image": str(img), "error": str(exc)} + out.write(json.dumps(err) + "\n") + out.flush() + print(f" [{i}/{len(images)}] {img.name} ERROR: {exc}", file=sys.stderr) + + print(f"\nWrote {args.output}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/download_samples.sh b/3.test_cases/pytorch/vllm/cosmos-reason/examples/download_samples.sh new file mode 100755 index 000000000..40fd0f753 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/download_samples.sh @@ -0,0 +1,28 @@ +#!/usr/bin/env bash +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# Download sample media for the Cosmos Reason example clients. +# +# Image: Unsplash (Unsplash License — free for commercial and non-commercial use) +# Video: Wikimedia Commons (CC BY 3.0) +set -euo pipefail + +cd "$(dirname "$0")" + +echo "Downloading sample.jpg (urban street scene from Unsplash)..." +curl -L -o sample.jpg \ + "https://images.unsplash.com/photo-1449824913935-59a10b8d2000?w=640" + +echo "Downloading sample video from Wikimedia Commons..." +curl -L -o sample_meteor.webm \ + "https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/2013_Russian_meteor_event_(Magnitogorsk).webm" + +echo "" +echo "Downloaded:" +ls -lh sample.jpg sample_meteor.webm +echo "" +echo "Run examples:" +echo " python3 image_vqa.py --image sample.jpg" +echo " python3 video_qa.py --video sample_meteor.webm" +echo " python3 auto_label.py --image-dir . --limit 1" diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/image_vqa.py b/3.test_cases/pytorch/vllm/cosmos-reason/examples/image_vqa.py new file mode 100755 index 000000000..03732e0b6 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/image_vqa.py @@ -0,0 +1,119 @@ +#!/usr/bin/env python3 +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +""" +image_vqa.py — Single-image visual question answering against a Cosmos Reason vLLM endpoint. + +Use cases: drive-recorder review, content moderation, scene understanding. + +Example: + python3 image_vqa.py --image sample.jpg \ + --prompt "What is happening in this scene? Reason about the visible cues." +""" + +import argparse +import base64 +import json +import os +import re +import sys +import time +from pathlib import Path +from typing import Optional, Tuple + +import requests +import urllib3 + + +def encode_image(path: str) -> str: + suffix = Path(path).suffix.lstrip(".").lower() or "jpeg" + if suffix == "jpg": + suffix = "jpeg" + with open(path, "rb") as f: + b64 = base64.b64encode(f.read()).decode("ascii") + return f"data:image/{suffix};base64,{b64}" + + +def parse_reasoning_response(message: dict) -> Tuple[Optional[str], str]: + """Return (reasoning_trace, answer) for both Reason1 (inline ) and Reason2 + (separate reasoning_content) response shapes.""" + reasoning = message.get("reasoning_content") or message.get("reasoning") + content = message.get("content") or "" + + if reasoning: + return reasoning.strip(), content.strip() + + # Reason1 path — ... inline in content + think_match = re.search(r"\s*(.*?)\s*", content, re.DOTALL) + answer_match = re.search(r"\s*(.*?)\s*", content, re.DOTALL) + if think_match: + trace = think_match.group(1).strip() + if answer_match: + return trace, answer_match.group(1).strip() + # No tag — return the rest of content after + rest = content[think_match.end():].strip() + return trace, rest + return None, content.strip() + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument("--endpoint", default=os.environ.get("ENDPOINT", "http://localhost:8000")) + parser.add_argument("--model", default=os.environ.get("MODEL_ID", "nvidia/Cosmos-Reason1-7B")) + parser.add_argument("--image", required=True, help="Path to local image file") + parser.add_argument("--prompt", default="What is in this image, and what is happening? Reason about visible cues.") + parser.add_argument("--max-tokens", type=int, default=512) + parser.add_argument("--temperature", type=float, default=0.6) + parser.add_argument("--system-prompt", + default="Answer in your reasoningyour answer format.") + parser.add_argument("--insecure", action="store_true", + help="Disable TLS certificate verification (for self-signed certs)") + args = parser.parse_args() + + verify_tls = not args.insecure + if args.insecure: + urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) + + image_url = encode_image(args.image) + payload = { + "model": args.model, + "messages": [ + {"role": "system", "content": args.system_prompt}, + {"role": "user", "content": [ + {"type": "image_url", "image_url": {"url": image_url}}, + {"type": "text", "text": args.prompt}, + ]}, + ], + "max_tokens": args.max_tokens, + "temperature": args.temperature, + } + + start = time.monotonic() + r = requests.post(f"{args.endpoint}/v1/chat/completions", + headers={"Content-Type": "application/json"}, + json=payload, + verify=verify_tls, + timeout=300) + elapsed_ms = int((time.monotonic() - start) * 1000) + r.raise_for_status() + data = r.json() + + msg = data["choices"][0]["message"] + reasoning, answer = parse_reasoning_response(msg) + + print(f"=== Response ({elapsed_ms} ms, {data['usage']['completion_tokens']} tokens) ===") + print() + if reasoning: + print("--- Reasoning ---") + print(reasoning) + print() + print("--- Answer ---") + print(answer) + print() + print("--- Usage ---") + print(json.dumps(data["usage"], indent=2)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/examples/video_qa.py b/3.test_cases/pytorch/vllm/cosmos-reason/examples/video_qa.py new file mode 100755 index 000000000..150490e77 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/examples/video_qa.py @@ -0,0 +1,102 @@ +#!/usr/bin/env python3 +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +""" +video_qa.py — Short video clip Q&A against a Cosmos Reason vLLM endpoint. + +Use case: AV scene understanding (Uber pattern), drive-recorder review, video moderation. + +Cosmos-Reason2 (Qwen3-VL) is video-native via `--media-io-kwargs '{"video":{"num_frames":-1}}'`. +Cosmos-Reason1 (Qwen2.5-VL) uses `--limit-mm-per-prompt '{"image":10,"video":10}'`. + +Example: + python3 video_qa.py --video clip.mp4 \ + --prompt "Describe the trajectory of the vehicle in this clip." +""" + +import argparse +import base64 +import json +import os +import sys +import time +from pathlib import Path + +import requests +import urllib3 + +from image_vqa import parse_reasoning_response # reuse the parser + + +def encode_video(path: str) -> str: + suffix = Path(path).suffix.lstrip(".").lower() or "mp4" + with open(path, "rb") as f: + b64 = base64.b64encode(f.read()).decode("ascii") + return f"data:video/{suffix};base64,{b64}" + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument("--endpoint", default=os.environ.get("ENDPOINT", "http://localhost:8000")) + parser.add_argument("--model", default=os.environ.get("MODEL_ID", "nvidia/Cosmos-Reason1-7B")) + parser.add_argument("--video", required=True, help="Path to local video file (mp4 / webm)") + parser.add_argument("--prompt", default="Describe what is happening in this video. Reason about the temporal cues.") + parser.add_argument("--max-tokens", type=int, default=800) + parser.add_argument("--temperature", type=float, default=0.5) + parser.add_argument("--system-prompt", + default="Answer in your reasoningyour answer format.") + parser.add_argument("--insecure", action="store_true", + help="Disable TLS certificate verification (for self-signed certs)") + args = parser.parse_args() + + verify_tls = not args.insecure + if args.insecure: + urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) + + if not os.path.exists(args.video): + print(f"ERROR: video not found at {args.video}", file=sys.stderr) + return 2 + + video_url = encode_video(args.video) + payload = { + "model": args.model, + "messages": [ + {"role": "system", "content": args.system_prompt}, + {"role": "user", "content": [ + {"type": "video_url", "video_url": {"url": video_url}}, + {"type": "text", "text": args.prompt}, + ]}, + ], + "max_tokens": args.max_tokens, + "temperature": args.temperature, + } + + start = time.monotonic() + r = requests.post(f"{args.endpoint}/v1/chat/completions", + headers={"Content-Type": "application/json"}, + json=payload, + verify=verify_tls, + timeout=300) + elapsed_ms = int((time.monotonic() - start) * 1000) + r.raise_for_status() + data = r.json() + + msg = data["choices"][0]["message"] + reasoning, answer = parse_reasoning_response(msg) + + print(f"=== Response ({elapsed_ms} ms, {data['usage']['completion_tokens']} tokens) ===") + print() + if reasoning: + print("--- Reasoning ---") + print(reasoning) + print() + print("--- Answer ---") + print(answer) + print() + print("--- Usage ---") + print(json.dumps(data["usage"], indent=2)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/README.md new file mode 100644 index 000000000..bddb81d01 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/README.md @@ -0,0 +1,167 @@ +# Cosmos Reason on the SageMaker HyperPod Inference Operator + +`InferenceEndpointConfig` CRD reference for serving Cosmos Reason on a SageMaker HyperPod +EKS cluster using the [HyperPod Inference Operator](https://aws.amazon.com/blogs/architecture/unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod/). + +This path uses the **AWS-managed vLLM Deep Learning Container** (`vllm:0.17-gpu-py312`) +with EFA, NCCL, and security patches pre-baked. The DLC tag `0.17-gpu-py312` corresponds +to vLLM 0.17.0; the [`../kubernetes/`](../kubernetes/) path uses upstream +`vllm/vllm-openai:v0.21.0` directly. + +For plain EKS clusters without HyperPod, or HyperPod Clusters without the Inference Operator, use the [`../kubernetes/`](../kubernetes/) path. + +## What's here + +| File | Purpose | +|------|---------| +| `endpoint.yaml` | `InferenceEndpointConfig` CRD spec | +| `hf-token-secret.yaml.example` | Reference for HF token Secret (recommended path is `kubectl create`) | + +## Prerequisites + +- HyperPod EKS cluster with at least one GPU node +- HyperPod Inference Operator installed. Three install paths, each with its own version scheme: + - **EKS Add-on (recommended)** — add-on `amazon-sagemaker-hyperpod-inference`, versioned + as `vX.Y.Z-eksbuild.N` (latest `v1.3.0-eksbuild.1` as of this writing). Install via the + HyperPod console (Add-ons → Inference Operator) or + `aws eks create-addon --addon-name amazon-sagemaker-hyperpod-inference`. Confirm the + current version for your region with + `aws eks describe-addon-versions --addon-name amazon-sagemaker-hyperpod-inference`. + - **`sagemaker-hyperpod-cli`** — CLI `v3.7.0+` with `hyp install`. + - **Helm chart** — `hyperpod-inference-operator` subchart `v2.1.0`, operator image `v3.1`. + Helm install may be deprecated in a future release in favor of the EKS Add-on. +- The Inference Operator's prerequisite IRSA roles must be configured at install time + (the operator does NOT need per-endpoint IAM) +- A TLS certificate output S3 bucket for endpoint certificate management + (auto-created at install time as `sagemaker---tls-`) +- `HF_TOKEN` with access to the model — see [parent README](../README.md#prerequisites) + +## Deploy + +```bash +# 1. Source environment +cd .. +cp env_vars.example env_vars +# Edit env_vars — set HF_TOKEN, INSTANCE_TYPE, etc. +source env_vars + + + +# 2. Create the HF token Secret +kubectl create secret generic hf-token \ + --namespace=${NAMESPACE} \ + --from-literal=token=${HF_TOKEN} + +# 3. Render and apply +cd hyperpod-eks/ +envsubst < endpoint.yaml | kubectl apply -f - + +# 4. Watch the operator drive the deployment +kubectl get inferenceendpointconfig cosmos-reason -w + +# Once status reports Ready, the endpoint URL is available: +kubectl get inferenceendpointconfig cosmos-reason \ + -o jsonpath='{.status.endpointUrl}' +``` + +First-launch is **5-10 minutes** — image pull (~23 GB AWS DLC) + HF model download + +vLLM init. Bump `maxDeployTimeInSeconds` to `3600` if the default `1800` proves too short. + +You are ready to test when the SageMaker Endpoint is successfully created. You can check this with: + +```bash +aws sagemaker describe-endpoint --endpoint-name cosmos-reason --region $AWS_REGION +``` + +## Test + +There are three ways to reach the deployed model: + +### Option 1: Port-forward (simplest, works from any machine) + +```bash +kubectl port-forward deploy/cosmos-reason 8000:8080 & + +# Health check +curl -s http://localhost:8000/health + +# Run an example +cd ../examples +python3 image_vqa.py --endpoint http://localhost:8000 --image sample.jpg --model "${MODEL_ID}" --system-prompt "" + +# Batch auto-label a directory of images +python3 auto_label.py --endpoint http://localhost:8000 --image-dir ./scenes/ --model "${MODEL_ID}" --output labels.jsonl +``` + +### Option 2: Operator-managed ALB (in-VPC by default) + +The operator provisions an ALB with TLS. By default the ALB is internal (VPC-only), but it +can be configured as internet-facing. If exposing publicly, ensure you have authentication +and access controls in place (e.g., WAF, Cognito, or mutual TLS). + +```bash +ENDPOINT=$(kubectl get inferenceendpointconfig cosmos-reason \ + -o jsonpath='{.status.endpointUrl}') + +# Requires VPC connectivity; -k for self-signed cert +curl -k "${ENDPOINT}/health" + +cd ../examples +python3 image_vqa.py --endpoint "${ENDPOINT}" --image sample.jpg --model "${MODEL_ID}" --system-prompt "" + +# Batch auto-label +python3 auto_label.py --endpoint "${ENDPOINT}" --image-dir ./scenes/ --model "${MODEL_ID}" --output labels.jsonl +``` + +> [!NOTE] +> If `.status.endpointUrl` is empty, the operator's cert-manager integration may not have +> completed. Verify with `kubectl get ingress -A` and `kubectl get pods -n cert-manager`. + +### Option 3: SageMaker Runtime API (works from anywhere, uses IAM auth) + +Invoke via the SageMaker runtime with AWS SigV4 signing — no VPC connectivity required. + +```bash +echo '{"model":"'"${MODEL_ID}"'","messages":[{"role":"user","content":"What is happening in this scene? Reason about the visible cues."}],"max_tokens":64}' > /tmp/payload.json + +aws sagemaker-runtime invoke-endpoint \ + --endpoint-name cosmos-reason \ + --region ${AWS_REGION} \ + --content-type application/json \ + --body fileb:///tmp/payload.json \ + /dev/stdout +``` + +For batch auto-labeling via the SageMaker Runtime, you would need to call `invoke-endpoint` +per image with the appropriate payload. The example scripts (`image_vqa.py`, `video_qa.py`, +`auto_label.py`) use plain HTTP requests and do not support SigV4 signing — use Option 1 +or 2 with those scripts. + +## Cleanup + +```bash +envsubst < endpoint.yaml | kubectl delete -f - +kubectl delete secret hf-token -n ${NAMESPACE} +``` + +## Operational notes + +- **First reference of the AWS vLLM DLC in this repo.** AWS launched a standalone vLLM + Deep Learning Container in late 2025 (separate from DJL-LMI). Image lives at + `763104351884.dkr.ecr..amazonaws.com/vllm:`. Tags: + - `vllm:0.17-gpu-py312` — vLLM 0.17.0 + - `vllm:server-sagemaker-cuda-v1` — vLLM 0.19.1 (newer "server" tag with `SM_VLLM_*` + env-var auto-translation to CLI args) +- **`maxDeployTimeInSeconds: 3600`** — default is 1800s (30 min) which is risky for + first deploys. Vendor image pull + model download + CUDA graph compile can hit 8 min. +- **`invocationEndpoint: v1/chat/completions`** — overrides the legacy default of + `invocations`. Required for vLLM's OpenAI-compatible API. +- **`modelInvocationPort.containerPort: 8080`** — matches the AWS DLC default port. + The upstream `vllm/vllm-openai` image uses 8000; the AWS DLC uses 8080. +- **`tokenSecretRef`** under `huggingFaceModel` — the operator passes the secret to the + worker pod. Secret key MUST be `token` (not `HF_TOKEN`). +- **No `JumpStartModel` path available** — Cosmos Reason is not in `SageMakerPublicHub`, + so we use the `InferenceEndpointConfig` BYO container CRD. +- **Autoscaling** — `replicas: 1` here for simplicity. The operator has dual-layer + autoscaling (KEDA pod-level + Karpenter node-level) configurable via `autoScaling.*` + fields. See [operator docs](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment.html). diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/endpoint.yaml b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/endpoint.yaml new file mode 100644 index 000000000..4f64cf0cc --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/endpoint.yaml @@ -0,0 +1,82 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +--- +apiVersion: inference.sagemaker.aws.amazon.com/v1 +kind: InferenceEndpointConfig +metadata: + name: cosmos-reason + namespace: ${NAMESPACE} +spec: + endpointName: cosmos-reason + modelName: ${ENDPOINT_NAME} + instanceType: ${HYPERPOD_INSTANCE_TYPE} + invocationEndpoint: v1/chat/completions + # First deploy can take 5-10 min (23 GB AWS DLC pull + HF download + vLLM init). + # Default maxDeployTimeInSeconds is 1800; bump for headroom. + maxDeployTimeInSeconds: 3600 + modelSourceConfig: + modelSourceType: huggingface + huggingFaceModel: + modelId: ${MODEL_ID} + tokenSecretRef: + name: hf-token + key: token + worker: + image: ${VLLM_IMAGE_AWS_DLC} + modelVolumeMount: + name: model-store + mountPath: /tmp/model + modelInvocationPort: + containerPort: 8080 + name: http + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "${TENSOR_PARALLEL_SIZE}" + limits: + nvidia.com/gpu: "${TENSOR_PARALLEL_SIZE}" + args: + - "--model" + - "${MODEL_ID}" + - "--max-model-len" + - "${MAX_MODEL_LEN}" + - "--tensor-parallel-size" + - "${TENSOR_PARALLEL_SIZE}" + - "--gpu-memory-utilization" + - "${GPU_MEMORY_UTILIZATION}" + - "--dtype" + - "${DTYPE}" + # For Cosmos-Reason2 (Qwen3-VL), uncomment the next two args: + # - "--reasoning-parser" + # - "qwen3" + # For Reason2 video reasoning, uncomment: + # - "--media-io-kwargs" + # - '{"video":{"num_frames":-1}}' + environmentVariables: + - name: OMP_NUM_THREADS + value: "1" + - name: VLLM_USAGE_SOURCE + value: "awsome-distributed-ai-cosmos-reason" + # AWS DLC reads SM_VLLM_* env vars — these override the entrypoint defaults. + # CLI args above may be ignored by the DLC entrypoint; env vars are the + # reliable path. + - name: SM_VLLM_MAX_MODEL_LEN + value: "${MAX_MODEL_LEN}" + - name: SM_VLLM_TENSOR_PARALLEL_SIZE + value: "${TENSOR_PARALLEL_SIZE}" + - name: SM_VLLM_GPU_MEMORY_UTILIZATION + value: "${GPU_MEMORY_UTILIZATION}" + - name: SM_VLLM_DTYPE + value: "${DTYPE}" + # For Cosmos-Reason2 (Qwen3-VL), uncomment: + # - name: SM_VLLM_REASONING_PARSER + # value: "qwen3" + # NOTE: nodeAffinity cannot be used when instanceType is set — the operator + # handles scheduling. Deep-health-check affinity is applied automatically. + # TLS for the operator-managed ALB. Bucket is created at operator install time. + # Override TLS_CERT_S3_URI in env_vars. + tlsConfig: + tlsCertificateOutputS3Uri: ${TLS_CERT_S3_URI} + loadBalancer: + healthCheckPath: /health diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/hf-token-secret.yaml.example b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/hf-token-secret.yaml.example new file mode 100644 index 000000000..0995538d8 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/hyperpod-eks/hf-token-secret.yaml.example @@ -0,0 +1,19 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# Reference manifest. Recommended approach is `kubectl create secret`: +# +# kubectl create secret generic hf-token \ +# --namespace=${NAMESPACE} \ +# --from-literal=token=${HF_TOKEN} +# +# Use this manifest only with sealed-secret / external-secret-operator. +--- +apiVersion: v1 +kind: Secret +metadata: + name: hf-token + namespace: ${NAMESPACE} +type: Opaque +stringData: + token: REPLACE_WITH_HF_TOKEN diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/README.md b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/README.md new file mode 100644 index 000000000..3bce73e6e --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/README.md @@ -0,0 +1,99 @@ +# Cosmos Reason on Vanilla EKS + +Plain Kubernetes `Deployment` + `Service` for any EKS cluster with GPU nodes. +Uses the upstream `vllm/vllm-openai` container image directly. + +If you have HyperPod EKS and want managed scale-to-zero, KV cache, and intelligent +routing, use the [`../hyperpod-eks/`](../hyperpod-eks/) path instead. + +## What's here + +| File | Purpose | +|------|---------| +| `deployment.yaml` | `Deployment` (pod spec) + `Service` (ClusterIP) | +| `hf-token-secret.yaml.example` | Reference manifest — recommended path is `kubectl create secret` rather than apply (so the token never lands in version control) | + +## Prerequisites + +- EKS cluster with at least one GPU node +- NVIDIA device plugin DaemonSet installed (Karpenter usually handles this) +- `kubectl` configured with cluster context +- `envsubst` (provided by `gettext`) +- `HF_TOKEN` with access to the model — see [parent README](../README.md#prerequisites) + +## Deploy + +```bash +# 1. Source environment +cd .. +cp env_vars.example env_vars +# Edit env_vars — set HF_TOKEN at minimum +source env_vars + +# 2. Create the HF token Secret +kubectl create secret generic hf-token \ + --namespace=${NAMESPACE} \ + --from-literal=token=${HF_TOKEN} + +# 3. Render and apply the manifests +cd kubernetes/ +envsubst < deployment.yaml | kubectl apply -f - + +# 4. Wait for the Pod to become Ready +kubectl wait --for=condition=Ready pod \ + -l app=cosmos-reason \ + --namespace=${NAMESPACE} \ + --timeout=10m + +# 5. Verify +kubectl logs -l app=cosmos-reason --tail=20 +``` + +First-launch is **3-8 minutes** (image pull + HF model download + vLLM init). + +## Test + +```bash +# Port-forward to localhost +kubectl port-forward -n ${NAMESPACE} svc/cosmos-reason 8000:8000 & + +# Hit /health +curl -s http://localhost:8000/health + +# List the loaded model +curl -s http://localhost:8000/v1/models | jq + +# Try an example +# This asks the model "What is in this image, and what is happening? Reason about visible cues." +cd ../examples +python3 image_vqa.py --image sample.jpg --model ${MODEL_ID} +``` + +You can also customize the question: +```bash +python3 image_vqa.py --image sample.jpg --prompt "How many vehicles are visible and what types are they?" +``` + +Test with auto labeling use case: +```bash +python3 auto_label.py --image-dir ./scenes/ --output labels.jsonl --limit 5 +``` + +## Cleanup + +```bash +envsubst < deployment.yaml | kubectl delete -f - +kubectl delete secret hf-token -n ${NAMESPACE} +``` + +## Notes + +- The `Service` is `ClusterIP`. To expose externally, add an `Ingress` (ALB recommended) + or change to `LoadBalancer`. +- **Autoscaling:** No HPA is included by default. Inference is GPU-bound and CPU-based + scaling is not a useful proxy for queue depth. For production, configure + [KEDA](https://keda.sh/) on the vLLM Prometheus metric `vllm:num_requests_running`, + or pair with Karpenter for node-level scale-out. +- `/dev/shm` is mounted via `emptyDir { medium: Memory }` (per the + [`awsome-distributed-ai`](https://github.com/awslabs/awsome-distributed-ai) review + conventions — never `hostPath: /dev/shm`). diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/deployment.yaml b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/deployment.yaml new file mode 100644 index 000000000..693730a52 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/deployment.yaml @@ -0,0 +1,147 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: cosmos-reason + namespace: ${NAMESPACE} + labels: + app: cosmos-reason +spec: + replicas: 1 + selector: + matchLabels: + app: cosmos-reason + strategy: + type: Recreate + template: + metadata: + labels: + app: cosmos-reason + spec: + restartPolicy: Always + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: node.kubernetes.io/instance-type + operator: In + values: + - ${INSTANCE_TYPE} + - ${HYPERPOD_INSTANCE_TYPE} + # Optional: prefer nodes that passed HyperPod deep-health checks. + # Harmless on plain EKS (label simply won't match). + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + preference: + matchExpressions: + - key: sagemaker.amazonaws.com/deep-health-check-status + operator: In + values: + - Passed + tolerations: + # HyperPod nodes carry this taint; harmless on plain EKS. + - key: sagemaker.amazonaws.com/node-health-status + operator: Equal + value: Schedulable + effect: NoSchedule + containers: + - name: vllm + image: ${VLLM_IMAGE_VANILLA} + imagePullPolicy: IfNotPresent + args: + - "--model" + - "${MODEL_ID}" + - "--host" + - "0.0.0.0" + - "--port" + - "${INVOCATION_PORT}" + - "--max-model-len" + - "${MAX_MODEL_LEN}" + - "--tensor-parallel-size" + - "${TENSOR_PARALLEL_SIZE}" + - "--gpu-memory-utilization" + - "${GPU_MEMORY_UTILIZATION}" + - "--dtype" + - "${DTYPE}" + - "--limit-mm-per-prompt" + - '{"video":2,"image":10}' + - "--mm-processor-kwargs" + - '{"max_pixels":20000000,"fps":1.0}' + # For Cosmos-Reason2 (Qwen3-VL), uncomment the following two args: + # - "--reasoning-parser" + # - "qwen3" + # For Reason2 video reasoning, uncomment: + # - "--media-io-kwargs" + # - '{"video":{"num_frames":-1}}' + env: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: token + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: token + - name: VLLM_USAGE_SOURCE + value: "awsome-distributed-ai-cosmos-reason" + ports: + - containerPort: ${INVOCATION_PORT} + name: http + protocol: TCP + resources: + requests: + nvidia.com/gpu: ${TENSOR_PARALLEL_SIZE} + cpu: "4" + memory: "32Gi" + limits: + nvidia.com/gpu: ${TENSOR_PARALLEL_SIZE} + volumeMounts: + - name: shmem + mountPath: /dev/shm + - name: hf-cache + mountPath: /root/.cache/huggingface + readinessProbe: + httpGet: + path: /health + port: ${INVOCATION_PORT} + initialDelaySeconds: 240 + periodSeconds: 15 + timeoutSeconds: 5 + failureThreshold: 80 + livenessProbe: + httpGet: + path: /health + port: ${INVOCATION_PORT} + initialDelaySeconds: 900 + periodSeconds: 60 + timeoutSeconds: 10 + volumes: + - name: shmem + emptyDir: + medium: Memory + sizeLimit: 8Gi + - name: hf-cache + emptyDir: + sizeLimit: 50Gi +--- +apiVersion: v1 +kind: Service +metadata: + name: cosmos-reason + namespace: ${NAMESPACE} + labels: + app: cosmos-reason +spec: + type: ClusterIP + selector: + app: cosmos-reason + ports: + - name: http + port: ${INVOCATION_PORT} + targetPort: ${INVOCATION_PORT} + protocol: TCP diff --git a/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/hf-token-secret.yaml.example b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/hf-token-secret.yaml.example new file mode 100644 index 000000000..cb3427006 --- /dev/null +++ b/3.test_cases/pytorch/vllm/cosmos-reason/kubernetes/hf-token-secret.yaml.example @@ -0,0 +1,23 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 +# +# This is a REFERENCE manifest. The recommended way to create the secret is: +# +# kubectl create secret generic hf-token \ +# --namespace=${NAMESPACE} \ +# --from-literal=token=${HF_TOKEN} +# +# Using `kubectl create` keeps the token out of any version-controlled file. Apply this +# manifest only if your operational model requires GitOps-managed secrets — and in that +# case, replace the placeholder below with a sealed-secret / external-secret reference. +--- +apiVersion: v1 +kind: Secret +metadata: + name: hf-token + namespace: ${NAMESPACE} +type: Opaque +stringData: + # DO NOT commit this with a real token. Use a sealed-secret / external-secret-operator + # reference, or create the secret imperatively with `kubectl create secret`. + token: REPLACE_WITH_HF_TOKEN