diff --git a/README.md b/README.md
index ed3571b3..65933580 100644
--- a/README.md
+++ b/README.md
@@ -25,6 +25,7 @@
- [Quickstart](#quickstart)
- [Generator with Diffusers](#generator-with-diffusers)
- [Generator with vLLM-Omni](#generator-with-vllm-omni)
+ - [Generator with SGLang](#generator-with-sglang)
- [Reasoner with Transformers](#reasoner-with-transformers)
- [Reasoner with vLLM](#reasoner-with-vllm)
- [Troubleshooting](#troubleshooting)
@@ -413,6 +414,169 @@ References:
+#### Generator with SGLang
+
+
+Expand SGLang generator setup, endpoints, and request reference
+
+Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. Cosmos 3 also includes video-with-sound and action/policy models; this SGLang section focuses on the currently supported text-to-image, text-to-video, and image-to-video generator serving paths.
+
+Supported checkpoints:
+
+| Model | Status | Notes |
+| --- | --- | --- |
+| `nvidia/Cosmos3-Nano` | Supported | Text-to-image, text-to-video, image-to-video |
+| `nvidia/Cosmos3-Super` | Supported | Use multiple GPUs for the 64B checkpoint |
+| `nvidia/Cosmos3-Super-Text2Image` | Supported | Text-to-image specialized checkpoint |
+| `nvidia/Cosmos3-Super-Image2Video` | Supported | Image-to-video specialized checkpoint |
+| `nvidia/Cosmos3-Nano-Policy-DROID` | Supported | Action/policy checkpoint |
+
+Install SGLang from the main branch with diffusion extras:
+
+```shell
+git clone --branch main https://github.com/sgl-project/sglang.git
+cd sglang
+python -m venv .venv
+source .venv/bin/activate
+python -m pip install --upgrade pip
+pip install -e "python[diffusion]"
+pip install "cosmos-guardrail==0.3.1"
+```
+
+> **Version note:** Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
+
+Start a Nano server:
+
+```shell
+sglang serve --model-path nvidia/Cosmos3-Nano
+```
+
+For a video-specialized checkpoint, use `Cosmos3-Super-Image2Video` with multiple GPUs:
+
+```shell
+sglang serve \
+ --model-path nvidia/Cosmos3-Super-Image2Video \
+ --num-gpus 4
+```
+
+This is the performance-mode setup. If it runs out of memory, switch to SGLang Diffusion's memory preset:
+
+```shell
+sglang serve \
+ --model-path nvidia/Cosmos3-Super-Image2Video \
+ --num-gpus 4 \
+ --performance-mode memory
+```
+
+Vision endpoints:
+
+| Mode | Endpoint | Notes |
+| --- | --- | --- |
+| Text to image | `POST /v1/images/generations` | Returns base64 by default for Cosmos 3 |
+| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
+| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
+| Video to Video | `POST /v1/videos` | Upload the conditioning video with `video_reference` and choose which frames stay as clean conditioning |
+| Video with sound | `POST /v1/videos` | Add `generate_sound=true` to produce a soundtrack alongside the video |
+
+Action modes use Cosmos 3 as a world model: they condition on an embodiment (`domain_name`) and exchange video and action sequences. Policy and inverse dynamics return a predicted action chunk, and read the action data from the completed result; forward dynamics returns only video.
+
+| Mode | `action_mode` | Input | Output |
+| --- | --- | --- | --- |
+| Policy | `policy` | Image + instruction | Video + predicted action chunk |
+| Inverse dynamics | `inverse_dynamics` | Video + instruction | Video + predicted action chunk |
+| Forward dynamics | `forward_dynamics` | Image + action chunk | Video |
+
+Pass embodiment settings through `extra_params`: `action_mode`, `domain_name` (for example `bridge_orig_lerobot`, `av`, or `camera_pose`), `raw_action_dim`, and optionally `action_view_point`. SGLang derives the action chunk length from `num_frames - 1`, so set `num_frames` to `action_chunk_size + 1`.
+For forward dynamics, pass the action trajectory directly in `extra_params["action"]` as a JSON array of shape `[action_chunk_size, raw_action_dim]`. SGLang does not use action_path for HTTP requests, so no `--allowed-local-media-path` setup is needed for action files.
+
+Text-to-video example:
+
+```shell
+# Submit an async video generation job and capture its ID.
+job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
+ --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
+ --form-string "negative_prompt=blurry, distorted, low quality" \
+ --form-string "size=1280x720" \
+ --form-string "num_frames=81" \
+ --form-string "fps=24" \
+ --form-string "num_inference_steps=35" \
+ --form-string "guidance_scale=4.0" \
+ --form-string "flow_shift=10.0" \
+ --form-string "seed=42" \
+ --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
+ | jq -r .id)
+
+# Poll until the job completes. Cosmos 3 video generation can take several minutes.
+status=""
+until [ "$status" = "completed" ]; do
+ status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status)
+ [ "$status" = "failed" ] && exit 1
+ sleep 5
+done
+
+# Download the completed MP4.
+curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
+ -o cosmos3_t2v_output.mp4
+```
+
+Text-to-image example:
+
+```shell
+curl -sS -X POST http://localhost:30000/v1/images/generations \
+ -H "Content-Type: application/json" \
+ -d '{
+ "prompt": "A warehouse robot folds a blue cloth on a clean workbench.",
+ "size": "1280x720",
+ "n": 1,
+ "num_inference_steps": 35,
+ "guidance_scale": 6.0,
+ "flow_shift": 10.0,
+ "seed": 0,
+ "extra_args": {
+ "use_resolution_template": false,
+ "guardrails": true
+ }
+ }'
+```
+
+Video-to-video-with-sound example:
+
+```shell
+job_id=$(curl -sS --fail-with-body -X POST "http://localhost:30000/v1/videos" \
+ -H "Accept: application/json" \
+ --form-string 'prompt=A small warehouse robot moves a blue box across a clean floor.' \
+ --form-string 'negative_prompt=blurry, distorted, low quality' \
+ --form-string 'size=1280x720' \
+ --form-string 'num_frames=61' \
+ --form-string 'fps=24' \
+ --form-string 'num_inference_steps=30' \
+ --form-string 'guidance_scale=4.0' \
+ --form-string 'flow_shift=10.0' \
+ --form-string 'seed=1234' \
+ --form-string 'generate_sound=true' \
+ --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true}' \
+ -F 'video_reference=@/path/to/video.mp4;type=video/mp4' \
+ | jq -r .id)
+
+# Poll until the job completes. Cosmos 3 video generation can take several minutes.
+status=""
+until [ "$status" = "completed" ]; do
+ status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status)
+ [ "$status" = "failed" ] && exit 1
+ sleep 5
+done
+
+# Download the completed MP4.
+curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
+ -o cosmos3_v2vs_output.mp4
+```
+
+SGLang accepts Cosmos 3 request options including `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models.
+
+For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).
+
+
+
#### Reasoner with Transformers
Coming soon!
@@ -497,10 +661,13 @@ We are building examples that show Cosmos 3 capabilities end to end, including w
| Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) |
| Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) |
| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) |
+| Generator (audiovisual) with SGLang | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible SGLang server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_sglang.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_sglang.ipynb) |
| Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) |
| Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) |
+| Forward dynamics with SGLang | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible SGLang server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb) |
| Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) |
| Inverse dynamics with vLLM-Omni | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) |
+| Inverse dynamics with SGLang | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible SGLang server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_sglang.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_sglang.ipynb) |
| Reasoner with Cosmos Framework | Reasoner | Text and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) |
| Reasoner with vLLM | Reasoner | Image and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README). | [Notebook](cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | [](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) |
diff --git a/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb b/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb
new file mode 100644
index 00000000..3c8ca66e
--- /dev/null
+++ b/cookbooks/cosmos3/generator/action/run_fd_with_sglang.ipynb
@@ -0,0 +1,1317 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "license-header",
+ "metadata": {},
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fdvl-title",
+ "metadata": {},
+ "source": [
+ "# Cosmos3 Nano Action: Forward Dynamics with SGLang\n",
+ "\n",
+ "This notebook runs Cosmos3 Nano **action forward-dynamics** inference through the SGlang OpenAI-compatible video API:\n",
+ "\n",
+ "```text\n",
+ "POST /v1/videos\n",
+ "```\n",
+ "\n",
+ "Forward dynamics predicts future visual observations from an initial image and an action trajectory. This notebook contains separate AV and robotics sections that each build their own input spec, run inference, and visualize generated videos.\n",
+ "\n",
+ "Start the SGLang server:\n",
+ "\n",
+ "```bash\n",
+ "docker rm -f cosmos3-sglang-notebook 2>/dev/null || true\n",
+ "\n",
+ "docker run -d --name cosmos3-sglang-notebook \\\n",
+ " --runtime nvidia --gpus '\"device=0\"' \\\n",
+ " -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n",
+ " -v \"~/.cache/huggingface:/root/.cache/huggingface\" \\\n",
+ " -v \"$PWD:/workspace\" \\\n",
+ " -p 30000:30000 --ipc=host \\\n",
+ " lmsysorg/sglang:dev \\\n",
+ " sglang serve \\\n",
+ " --model-path nvidia/Cosmos3-Nano\n",
+ "\n",
+ "# Wait until this returns model metadata before running the inference cell.\n",
+ "curl http://localhost:30000/v1/models\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fdvl-vars-md",
+ "metadata": {},
+ "source": [
+ "## Configure Notebook Variables\n",
+ "\n",
+ "Run this cell after the SGLang server is available. It resolves local input/output paths and stores generated outputs under `outputs/cosmos3_action_sglang/` by default.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fdvl-vars-code",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pathlib import Path\n",
+ "import os\n",
+ "\n",
+ "\n",
+ "def find_repo_root(start: Path) -> Path:\n",
+ " for path in [start, *start.parents]:\n",
+ " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n",
+ " return path\n",
+ "\n",
+ " return start\n",
+ "\n",
+ "\n",
+ "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n",
+ "COSMOS3_REPO = Path(os.environ.get(\"COSMOS3_REPO\", COSMOS_ROOT / \"packages\" / \"cosmos3\")).resolve()\n",
+ "COSMOS3_OUTPUT_ROOT = Path(\n",
+ " os.environ.get(\"COSMOS3_SGLANG_OUTPUT_ROOT\", COSMOS_ROOT / \"outputs\" / \"cosmos3_action_sglang\")\n",
+ ").resolve()\n",
+ "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n",
+ "SGLANG_BASE_URL = os.environ.get(\"COSMOS3_SGLANG_BASE_URL\", \"http://localhost:30000\").rstrip(\"/\")\n",
+ "\n",
+ "\n",
+ "def resolve_input(rel_path: str) -> str:\n",
+ " path = (COSMOS_ROOT / rel_path).resolve()\n",
+ " assert path.exists(), f\"missing input: {path}\"\n",
+ " return str(path)\n",
+ "\n",
+ "\n",
+ "COSMOS3_OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n",
+ "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+ "\n",
+ "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n",
+ "print(\"COSMOS3_REPO:\", COSMOS3_REPO)\n",
+ "print(\"COSMOS3_INPUT_DIR:\", COSMOS3_INPUT_DIR)\n",
+ "print(\"COSMOS3_OUTPUT_ROOT:\", COSMOS3_OUTPUT_ROOT)\n",
+ "print(\"COSMOS3_SGLANG_BASE_URL:\", SGLANG_BASE_URL)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fdvl-av-md",
+ "metadata": {},
+ "source": [
+ "## AV\n",
+ "\n",
+ "In this example, we show how to provide a set of ego poses of a autonomous vehicle and an image to generate driving videos using Cosmos3-Nano.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fdvl-av-spec-md",
+ "metadata": {},
+ "source": [
+ "### Create the AV Forward-Dynamics Input Spec\n",
+ "\n",
+ "AV forward-dynamics inference is driven by a JSONL spec, one line per run. Each line shares the same start frame (`vision_path`) but uses a different ego trajectory (`action_path`), so we get one generated video per trajectory.\n",
+ "\n",
+ "The action input is prepared in a JSON file, which can be converted from camera poses (camera-to-world transformation, OpenCV convention, unit in meter) via `pose_abs_to_rel`:\n",
+ "\n",
+ "```python\n",
+ "if str(COSMOS3_REPO) not in sys.path:\n",
+ " sys.path.insert(0, str(COSMOS3_REPO))\n",
+ "from cosmos_framework.data.vfm.action.pose_utils import pose_abs_to_rel\n",
+ "\n",
+ "poses_abs = np.array([...]) # [T, 4, 4], camera-to-world transformation in opencv convention, unit in meter\n",
+ "poses_rel = pose_abs_to_rel(\n",
+ " poses_abs,\n",
+ " rotation_format=\"rot6d\",\n",
+ " pose_convention=\"backward_framewise\",\n",
+ " translation_scale=1.35,\n",
+ ") # [T-1, 9], translation(3), rot6d(6), framewise relative transformation\n",
+ "\n",
+ "with open(\"custom_traj.json\", \"w\") as f:\n",
+ " json.dump(poses_rel, f)\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fdvl-av-spec-code",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# `resolve_input` and the COSMOS3_* paths come from the variables cell.\n",
+ "import json\n",
+ "\n",
+ "# Local AV inputs, relative to the cosmos repo root.\n",
+ "av_input_image = \"cookbooks/cosmos3/generator/action/assets/images/av_0.jpg\"\n",
+ "av_input_actions = {\n",
+ " \"av_forward\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json\",\n",
+ " \"av_left\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_left.json\",\n",
+ " \"av_right\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_right.json\",\n",
+ "}\n",
+ "\n",
+ "av_vision_path = resolve_input(av_input_image)\n",
+ "av_records = [\n",
+ " {\n",
+ " \"action_chunk_size\": 60,\n",
+ " \"action_path\": resolve_input(action_rel),\n",
+ " \"domain_name\": \"av\",\n",
+ " \"fps\": 10,\n",
+ " \"image_size\": 480,\n",
+ " \"view_point\": \"ego_view\",\n",
+ " \"model_mode\": \"forward_dynamics\",\n",
+ " \"name\": name,\n",
+ " \"prompt\": \"You are an autonomous vehicle planning system.\",\n",
+ " \"seed\": 0,\n",
+ " \"vision_path\": av_vision_path,\n",
+ " }\n",
+ " for name, action_rel in av_input_actions.items()\n",
+ "]\n",
+ "\n",
+ "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+ "av_fd_input_path = COSMOS3_INPUT_DIR / \"action_forward_dynamics_av_custom.jsonl\"\n",
+ "av_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in av_records))\n",
+ "av_fd_output_dir = COSMOS3_OUTPUT_ROOT / \"action_forward_dynamics_av_custom\"\n",
+ "\n",
+ "os.environ[\"COSMOS3_AV_FD_INPUT\"] = str(av_fd_input_path)\n",
+ "os.environ[\"COSMOS3_AV_FD_OUTPUT\"] = str(av_fd_output_dir)\n",
+ "\n",
+ "print(\"wrote AV spec:\", av_fd_input_path)\n",
+ "print(\"AV runs:\", list(av_input_actions))\n",
+ "print(av_fd_input_path.read_text())\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fdvl-av-traj-md",
+ "metadata": {},
+ "source": [
+ "### Visualize AV Input Trajectories\n",
+ "\n",
+ "Before generating any video, plot each input ego trajectory as a 3D camera path with frustums and a top-down bird's-eye view.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fdvl-av-traj-code",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import sys\n",
+ "import json\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "from matplotlib.collections import LineCollection\n",
+ "from mpl_toolkits.mplot3d.art3d import Line3DCollection\n",
+ "import os\n",
+ "\n",
+ "# The notebook kernel may differ from the framework venv, so put the repo on the\n",
+ "# path before importing `cosmos_framework`.\n",
+ "COSMOS3_FRAMEWORK_PATH = os.environ.get(\"COSMOS3_FRAMEWORK_PATH\")\n",
+ "if str(COSMOS3_FRAMEWORK_PATH) not in sys.path:\n",
+ " sys.path.insert(0, str(COSMOS3_FRAMEWORK_PATH))\n",
+ "from cosmos_framework.data.vfm.action.pose_utils import pose_rel_to_abs\n",
+ "\n",
+ "# frustum: apex + image-rectangle corners (camera +Z forward), and their edges\n",
+ "_FRUSTUM = np.array([[0, 0, 0], [-1, -1, 1], [1, -1, 1], [1, 1, 1], [-1, 1, 1]], float)\n",
+ "_EDGES = [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (2, 3), (3, 4), (4, 1)]\n",
+ "\n",
+ "\n",
+ "def visualize_pose(poses_abs, *, n_frustums=20, scale_frac=0.03, aspect=16 / 9,\n",
+ " fov_deg=60.0, vertical_exaggeration=1.0, cmap=\"turbo\",\n",
+ " title=None, save_path=None, show=True):\n",
+ " \"\"\"3D camera trajectory (with frustums) + a top-down bird's-eye view.\"\"\"\n",
+ " poses_abs = np.asarray(poses_abs)\n",
+ " pos = poses_abs[:, :3, 3]\n",
+ " fwd = poses_abs[:, :3, 2]\n",
+ " T = len(pos)\n",
+ " colors = plt.get_cmap(cmap)(np.arange(T) / max(T - 1, 1))\n",
+ " scale = max(np.ptp(pos, axis=0).max() * scale_frac, 1e-3)\n",
+ " step = max(1, T // max(n_frustums, 1))\n",
+ " xzy = [0, 2, 1]\n",
+ "\n",
+ " fig = plt.figure(figsize=(14, 6))\n",
+ "\n",
+ " ax = fig.add_subplot(1, 2, 1, projection=\"3d\")\n",
+ " path = pos[:, xzy]\n",
+ " ax.plot(*path.T, color=\"0.6\", lw=1.0, alpha=0.7)\n",
+ " lines, lcolors, allpts = [], [], [path]\n",
+ " for i in range(0, T, step):\n",
+ " cw = ((_FRUSTUM * [aspect, 1, 1] * scale * np.tan(np.radians(fov_deg) / 2))\n",
+ " @ poses_abs[i, :3, :3].T + poses_abs[i, :3, 3])[:, xzy]\n",
+ " allpts.append(cw)\n",
+ " lines += [[cw[a], cw[b]] for a, b in _EDGES]\n",
+ " lcolors += [colors[i]] * len(_EDGES)\n",
+ " ax.add_collection3d(Line3DCollection(lines, colors=lcolors, linewidths=1.2))\n",
+ " ax.scatter(*path[0], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n",
+ " ax.scatter(*path[-1], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n",
+ " rng = np.clip(np.ptp(np.concatenate(allpts), axis=0), 1e-9, None)\n",
+ " ax.set_box_aspect((rng[0], rng[1], rng[2] * vertical_exaggeration))\n",
+ " ax.set_xlabel(\"X (m)\", labelpad=12)\n",
+ " ax.set_ylabel(\"Z forward (m)\", labelpad=12)\n",
+ " ax.set_zlabel(\"Y up (m)\", labelpad=10)\n",
+ " ax.set_zticks([])\n",
+ " ax.set_title(title or f\"Camera trajectory + frustums ({T} frames)\")\n",
+ " ax.legend(loc=\"upper left\")\n",
+ " ax.view_init(elev=22, azim=-70)\n",
+ "\n",
+ " ax2 = fig.add_subplot(1, 2, 2)\n",
+ " seg = np.stack([pos[:-1, [0, 2]], pos[1:, [0, 2]]], axis=1)\n",
+ " lc = LineCollection(seg, cmap=cmap, norm=plt.Normalize(0, T - 1), linewidth=2.5)\n",
+ " lc.set_array(np.arange(T - 1))\n",
+ " ax2.add_collection(lc)\n",
+ " ax2.quiver(pos[::step, 0], pos[::step, 2], fwd[::step, 0], fwd[::step, 2],\n",
+ " color=colors[::step], angles=\"xy\", width=0.005, scale=22, zorder=3)\n",
+ " ax2.scatter(*pos[0, [0, 2]], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n",
+ " ax2.scatter(*pos[-1, [0, 2]], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n",
+ " ax2.set_xlabel(\"X (m)\")\n",
+ " ax2.set_ylabel(\"Z forward (m)\")\n",
+ " ax2.set_title(\"Top-down (bird's-eye view)\")\n",
+ " ax2.set_aspect(\"equal\", adjustable=\"datalim\")\n",
+ " ax2.autoscale_view()\n",
+ " ax2.legend()\n",
+ " fig.colorbar(lc, ax=ax2, label=\"frame index\")\n",
+ "\n",
+ " plt.tight_layout(w_pad=6)\n",
+ " if save_path:\n",
+ " fig.savefig(save_path, dpi=120, bbox_inches=\"tight\")\n",
+ " print(\"saved\", save_path)\n",
+ " if show:\n",
+ " plt.show()\n",
+ "\n",
+ "\n",
+ "for record in av_records:\n",
+ " name = record[\"name\"]\n",
+ " with open(record[\"action_path\"]) as f:\n",
+ " poses_rel = np.array(json.load(f))\n",
+ "\n",
+ " # AV action convention: rot6d rotation, backward_framewise, translation_scale = 1.35.\n",
+ " poses_abs = pose_rel_to_abs(\n",
+ " poses_rel,\n",
+ " rotation_format=\"rot6d\",\n",
+ " pose_convention=\"backward_framewise\",\n",
+ " translation_scale=1.35,\n",
+ " )\n",
+ " print(name, poses_rel.shape, poses_abs.shape)\n",
+ " visualize_pose(poses_abs, title=f\"{name}: camera trajectory + frustums ({len(poses_abs)} frames)\", show=True)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fdvl-av-run-md",
+ "metadata": {},
+ "source": [
+ "### Run AV Forward-Dynamics Inference\n",
+ "\n",
+ "Runs `Cosmos3-Nano` on every line of the AV spec through SGLang. Each run writes its video to:\n",
+ "\n",
+ "```text\n",
+ "