nvidia-cosmos · Coco-Ben · Aug 16, 2025
diff --git a/README.md b/README.md
@@ -4,8 +4,9 @@
 
 ### [Paper](https://arxiv.org/abs/2503.15558) | [Website](https://research.nvidia.com/labs/dir/cosmos-reason1/) | [HuggingFace](https://huggingface.co/collections/nvidia/cosmos-reason1-67c9e926206426008f1da1b7)
 
-NVIDIA Cosmos Reason – an open, customizable, 7B-parameter reasoning vision language model (VLM) for physical AI and robotics - enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.
-Cosmos Reason excels at navigating the long tail of diverse scenarios of the physical world with spatial-temporal understanding. Cosmos Reason is post-trained with physical common sense and embodied reasoning data with supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.
+NVIDIA Cosmos-Reason1 is an [open](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license), customizable, 7B-parameter reasoning vision language model (VLM) for physical AI and robotics. It enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding, and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.
+
+Cosmos-Reason1 excels at navigating the long tail of diverse physical world scenarios with spatial-temporal understanding. The Cosmos-Reason1 model is post-trained with physical common sense and embodied reasoning data, including supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.
 
 ## News
 
@@ -18,46 +19,49 @@ Cosmos Reason excels at navigating the long tail of diverse scenarios of the phy
 
 * [Cosmos-Reason1-7B](https://huggingface.co/nvidia/Cosmos-Reason1-7B)
 
+## Minimum Hardware Requirements
+
+* Inference: 1 GPU with 24GB memory
+
+* Post-training: 4 GPUs with 80GB of memory
+
 ## Setup
 
-Install system dependencies:
+1. Install system dependencies:
 
-* [pkgx](https://github.com/pkgxdev/pkgx?tab=readme-ov-file#quickstart)
+   * [pkgx](https://github.com/pkgxdev/pkgx?tab=readme-ov-file#quickstart)
 
-  ```shell
-  brew install pkgx || curl https://pkgx.sh | sh
-  ```
+     ```shell
+     brew install pkgx || curl https://pkgx.sh | sh
+     ```
 
-* [uv](https://docs.astral.sh/uv/getting-started/installation/)
+   * [uv](https://docs.astral.sh/uv/getting-started/installation/)
 
-  ```shell
-  curl -LsSf https://astral.sh/uv/install.sh | sh
-  source $HOME/.local/bin/env
-  ```
+     ```shell
+     curl -LsSf https://astral.sh/uv/install.sh | sh
+     source $HOME/.local/bin/env
+     ```
 
-* [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
+   * [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
 
-  ```shell
-  uv tool install -U "huggingface_hub[cli]"
-  hf auth login
-  ```
+     ```shell
+     uv tool install -U "huggingface_hub[cli]"
+     hf auth login
+     ```
 
-Clone the repository:
+2. Clone the repository:
 
-```shell
-git clone https://github.com/nvidia-cosmos/cosmos-reason1.git
-cd cosmos-reason1
-```
+   ```shell
+   git clone https://github.com/nvidia-cosmos/cosmos-reason1.git
+   cd cosmos-reason1
+   ```
 
 ## Inference
 
-Minimum Requirements:
-
-* 1 GPU with 24GB memory
 
 Cosmos-Reason1 is included in [`transformers>=4.51.3`](https://huggingface.co/docs/transformers/en/index).
 
-We provide example inference scripts:
+Cosmos-Reason1 provides the following example inference scripts:
 
 * [Minimal example](scripts/inference_sample.py)
 
@@ -67,29 +71,29 @@ We provide example inference scripts:
 
 * [Full example](scripts/inference.py)
 
-  Caption the video:
+  Caption the sample video:
 
   ```shell
   ./scripts/inference.py --prompt prompts/caption.yaml --videos assets/sample.mp4 -v
   ```
 
-  Ask a question about the video with reasoning:
+  Ask a question about the sample video with reasoning:
 
   ```shell
   ./scripts/inference.py --prompt prompts/question.yaml --question 'What are the potential safety hazards?' --reasoning --videos assets/sample.mp4 -v
   ```
 
-  Temporally caption the video and save the input frames to `outputs/temporal_caption_text` for debugging:
+  Temporally caption the sample video and save the input frames to `outputs/temporal_caption_text` for debugging:
 
   ```shell
   ./scripts/inference.py --prompt prompts/temporal_caption_text.yaml --videos assets/sample.mp4 --timestamp -v -o outputs/temporal_caption_text
   ```
 
-  Configure inference by editing:
+You can configure inference by editing the prompt and configuration files:
 
-  * [Prompts](prompts/README.md)
-  * [Sampling Parameters](configs/sampling_params.yaml)
-  * [Vision Processor Config](configs/vision_config.yaml)
+* [Prompts](prompts/README.md)
+* [Sampling Parameters](configs/sampling_params.yaml)
+* [Vision Processor Config](configs/vision_config.yaml)
 
 ## Tutorials
 
@@ -101,7 +105,7 @@ We provide example inference scripts:
 
 ## Post-Training
 
-The [nvidia-cosmos/cosmos-rl](https://github.com/nvidia-cosmos/cosmos-rl) repository is an async post-training framework specialized for Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). It prioritizes performance, scalability, and fault tolerance.
+Cosmos-Reason1 uses the [nvidia-cosmos/cosmos-rl](https://github.com/nvidia-cosmos/cosmos-rl) repository for post-training. It is an async post-training framework specialized for Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). It prioritizes performance, scalability, and fault tolerance.
 
 To support a custom dataset format, use the [minimal Hugging Face example](examples/post_training_hf/README.md) as a template.
 

diff --git a/examples/benchmark/README.md b/examples/benchmark/README.md
@@ -8,61 +8,57 @@ This guide provides instructions for evaluating models on the [Cosmos-Reason1 Be
 
 ## Setup
 
-Prerequisites:
+1. Perform the [Setup](../../README.md#setup) steps outlined in the main README.
 
-- [Setup](../../README.md#setup)
-
-Change directory:
+2. Change to the `benchmark` directory::
 
 ```shell
 cd examples/benchmark
 ```
 
-## Prepare Dataset
-
-Request access:
-
-- [AgiBotWorld-Beta](https://huggingface.co/datasets/agibot-world/AgiBotWorld-Beta/tree/main)
-
-Download annotations and sample video clips:
-
-```bash
-# Download
-hf download --repo-type dataset nvidia/Cosmos-Reason1-Benchmark --local-dir data/benchmark
-# Unpack
-for file in data/benchmark/**/*.tar.gz; do tar -xzf "$file" -C "$(dirname "$file")"; done
-```
-
-> **Note:**
-> This downloads:
->
-> - Annotations for:
->   - `AV` # For autonomous vehicles' general description, driving difficulty, and notice
->   - [RoboVQA](https://robovqa.github.io/) # Videos, instructions, and question-answer pairs of agents (robots, humans, humans-with-grasping-tools) executing a task.
->   - [AgiBot-World](https://github.com/OpenDriveLab/AgiBot-World) # A wide range of real-life tasks for robot manipulation
->   - [BridgeData V2](https://rail-berkeley.github.io/bridgedata/) # A wide array of robotic manipulation behaviors
->   - [HoloAssist Dataset](https://holoassist.github.io/) # Crucial first-person perspectives that provide natural and immersive understanding of human actions
-> - Video clips for:
->   - `AV`
->   - `RoboVQA`
-
-[Optional] Downloading the full dataset will take a very long time and requires multiple terabytes of disk space:
-
-```bash
-./tools/eval/process_raw_data.py --data_dir data --task benchmark
-```
-
-> **Note:**
-> This downloads:
->
-> - Video clips for:
->   - AgiBot-World
->   - BridgeData V2
->   - HoloAssist
+## Prepare the Dataset
+
+1. Request access to the [AgiBotWorld-Beta](https://huggingface.co/datasets/agibot-world/AgiBotWorld-Beta/tree/main) dataset.
+
+2. Download annotations and sample video clips:
+
+   ```bash
+   # Download
+   hf download --repo-type dataset nvidia/Cosmos-Reason1-Benchmark --local-dir data/benchmark
+   # Unpack
+   for file in data/benchmark/**/*.tar.gz; do tar -xzf "$file" -C "$(dirname "$file")"; done
+   ```
+
+   > **Note:**
+   > The following will be downloaded:
+   >
+   > - Annotations:
+   >   - `AV` # For autonomous vehicles' general description, driving difficulty, and notice
+   >   - [RoboVQA](https://robovqa.github.io/) # Videos, instructions, and question-answer pairs of agents (robots, humans, humans-with-grasping-tools) executing a task.
+   >   - [AgiBot-World](https://github.com/OpenDriveLab/AgiBot-World) # A wide range of real-life tasks for robot manipulation
+   >   - [BridgeData V2](https://rail-berkeley.github.io/bridgedata/) # A wide array of robotic manipulation behaviors
+   >   - [HoloAssist Dataset](https://holoassist.github.io/) # Crucial first-person perspectives that provide natural and immersive understanding of human actions
+   > - Video clips:
+   >   - `AV`
+   >   - `RoboVQA`
+
+3. [Optional] Download the full dataset. This will take a long time and requires multiple terabytes of disk space:
+
+   ```bash
+   ./tools/eval/process_raw_data.py --data_dir data --task benchmark
+   ```
+
+   > **Note:**
+   > The following will be downloaded:
+   >
+   > - Video clips:
+   >   - AgiBot-World
+   >   - BridgeData V2
+   >   - HoloAssist
 
 ## Run Evaluation
 
-Configure evaluation settings by editing [`configs/evaluate.yaml`](configs/evaluate.yaml).
+Configure evaluation settings by editing the [`configs/evaluate.yaml`](configs/evaluate.yaml) file.
 
 Evaluate the model on the dataset:
 
@@ -72,7 +68,7 @@ Evaluate the model on the dataset:
 
 ### Calculate Accuracy
 
-Calculate accuracy of the results:
+Use the following script to calculate accuracy of the results:
 
 ```bash
 ./tools/eval/calculate_accuracy.py --result_dir outputs/benchmark
@@ -84,4 +80,4 @@ The script compares model predictions against ground-truth answers:
 
 For open-ended questions, a prediction is considered correct if it exactly matches the ground truth (case-insensitive string match). For multiple-choice questions, the selected option is compared against the correct choice.
 
-> **Note:** These scoring rules follow common practices in VLM QA literature, but users are encouraged to adapt or extend them for specific use cases (e.g., partial credit, VQA-style soft accuracy).
+> **Note:** These scoring rules follow common practices in VLM QA literature, but users are encouraged to adapt or extend them for specific use cases (e.g. partial credit, VQA-style soft accuracy).
diff --git a/examples/post_training/README.md b/examples/post_training/README.md
@@ -6,61 +6,62 @@ This guide provides instructions for post-training Cosmos-Reason1 on the [SFT](h
 
 ## Setup
 
-### Install
+### Installation
 
-Prerequisites:
+1. Perform the [Setup](../../README.md#setup) steps outlined in the main README.
 
-- [Setup](../../README.md#setup)
+2. Install system dependencies:
 
-Install system dependencies:
+   - [redis](https://redis.io/docs/latest/operate/oss_and_stack/install/archive/install-redis/)
 
-- [redis](https://redis.io/docs/latest/operate/oss_and_stack/install/archive/install-redis/)
+     ```shell
+     pkgm install redis-server
+     # or
+     conda install -c conda-forge redis-server
+     ```
 
-  ```shell
-  pkgm install redis-server
-  # or
-  conda install -c conda-forge redis-server
-  ```
-
-Install the package:
+3. Install the package:
 
 ```shell
 cd examples/post_training
 just install
 source .venv/bin/activate
 ```
 
-### Monitor
+### Monitoring
 
-[Optional] We recommend that you to use [wandb](https://wandb.ai/) for training monitoring.
+We recommend using [wandb](https://wandb.ai/) to monitor training.
 
 1. Acquire your [WANDB_API_KEY](https://wandb.ai/authorize).
-1. Login:
+
+2. Log in to wandb:
 
   ```bash
   uv tool install -U wandb
   wandb login
   ```
 
-When you run training, you will see the `wandb` link in the logging:
+Now, when you run training, you will observe the `wandb` link in the logging:
 
 ```bash
 wandb: 🚀 View run at https://wandb.ai/${WANDB_USER_NAME}/${config.logging.project_name}/runs/20250515101157
 ```
 
 ## Training
 
-> **_NOTE:_** Following the below training steps will trigger downloading around 200GB of model and dataset files from Hugging Face, please make sure your `~/.cache` directory (or set `HF_HOME` and `COSMOS_CACHE` environment variables to a directory that) has enough storage space.
+> **_NOTE:_** Following the below training steps will trigger downloading around 200GB of model and dataset files from Hugging Face. Ensure that your `~/.cache` directory has enough storage space or that the `HF_HOME` and `COSMOS_CACHE` environment variables are set to a directory with enough space.
 
 ### Supervised Fine-Tuning (SFT)
 
-The SFT training can improve the model's capability on certain tasks with a similar distribution of the training dataset. E.g., training with `robovqa` dataset can improve the model's performance on the robotics-focused visual question answering scenarios.
+SFT training can improve model capability on tasks that have a similar distribution to that of the training dataset: For example, training with the `robovqa` dataset can improve performance with robotics-focused visual question answering scenarios.
 
-Minimum Requirements:
+#### Minimum Requirements
 
 - 4 GPUs with 80GB of memory
 
-Configure settings by editing [configs/sft.toml](configs/sft.toml). Variants:
+#### Configuration
+
+Configure settings by editing [configs/sft.toml](configs/sft.toml). Variants include the following:
 
 - 8 GPU
 
@@ -69,7 +70,9 @@ Configure settings by editing [configs/sft.toml](configs/sft.toml). Variants:
   dp_shard_size = 8
   ```
 
-Run training:
+#### Training
+
+Run training as follows:
 
 ```shell
 cosmos-rl --config configs/sft.toml ./tools/dataset/cosmos_sft.py
@@ -83,13 +86,15 @@ After training finishes, the final output checkpoint can be found in the log:
 
 ### Reinforcement Learning (RL)
 
-The RL training can improve the model's reasoning capability on certain tasks with the reasoning training dataset.
+RL training can improve model reasoning capability on certain tasks with the reasoning training dataset.
 
-Minimum Requirements:
+#### Minimum Requirements
 
 - 4 GPUs with 80GB of memory
 
-Configure settings by editing [configs/rl.toml](configs/rl.toml). Variants:
+#### Configuration
+
+Configure settings by editing [configs/rl.toml](configs/rl.toml). Variants include the following:
 
 - 8 GPU
 
@@ -101,7 +106,9 @@ Configure settings by editing [configs/rl.toml](configs/rl.toml). Variants:
   dp_shard_size = 4
   ```
 
-Run training:
+#### Training
+
+Run training as follows:
 
 ```shell
 cosmos-rl --config configs/rl.toml tools/dataset/cosmos_grpo.py