Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 38 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@

### [Paper](https://arxiv.org/abs/2503.15558) | [Website](https://research.nvidia.com/labs/dir/cosmos-reason1/) | [HuggingFace](https://huggingface.co/collections/nvidia/cosmos-reason1-67c9e926206426008f1da1b7)

NVIDIA Cosmos Reason – an open, customizable, 7B-parameter reasoning vision language model (VLM) for physical AI and robotics - enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.
Cosmos Reason excels at navigating the long tail of diverse scenarios of the physical world with spatial-temporal understanding. Cosmos Reason is post-trained with physical common sense and embodied reasoning data with supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.
NVIDIA Cosmos-Reason1 is an [open](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license), customizable, 7B-parameter reasoning vision language model (VLM) for physical AI and robotics. It enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding, and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.

Cosmos-Reason1 excels at navigating the long tail of diverse physical world scenarios with spatial-temporal understanding. The Cosmos-Reason1 model is post-trained with physical common sense and embodied reasoning data, including supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.

## News

Expand All @@ -18,46 +19,49 @@ Cosmos Reason excels at navigating the long tail of diverse scenarios of the phy

* [Cosmos-Reason1-7B](https://huggingface.co/nvidia/Cosmos-Reason1-7B)

## Minimum Hardware Requirements

* Inference: 1 GPU with 24GB memory

* Post-training: 4 GPUs with 80GB of memory

## Setup

Install system dependencies:
1. Install system dependencies:

* [pkgx](https://github.com/pkgxdev/pkgx?tab=readme-ov-file#quickstart)
* [pkgx](https://github.com/pkgxdev/pkgx?tab=readme-ov-file#quickstart)

```shell
brew install pkgx || curl https://pkgx.sh | sh
```
```shell
brew install pkgx || curl https://pkgx.sh | sh
```

* [uv](https://docs.astral.sh/uv/getting-started/installation/)
* [uv](https://docs.astral.sh/uv/getting-started/installation/)

```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
```
```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
```

* [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
* [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)

```shell
uv tool install -U "huggingface_hub[cli]"
hf auth login
```
```shell
uv tool install -U "huggingface_hub[cli]"
hf auth login
```

Clone the repository:
2. Clone the repository:

```shell
git clone https://github.com/nvidia-cosmos/cosmos-reason1.git
cd cosmos-reason1
```
```shell
git clone https://github.com/nvidia-cosmos/cosmos-reason1.git
cd cosmos-reason1
```

## Inference

Minimum Requirements:

* 1 GPU with 24GB memory

Cosmos-Reason1 is included in [`transformers>=4.51.3`](https://huggingface.co/docs/transformers/en/index).

We provide example inference scripts:
Cosmos-Reason1 provides the following example inference scripts:

* [Minimal example](scripts/inference_sample.py)

Expand All @@ -67,29 +71,29 @@ We provide example inference scripts:

* [Full example](scripts/inference.py)

Caption the video:
Caption the sample video:

```shell
./scripts/inference.py --prompt prompts/caption.yaml --videos assets/sample.mp4 -v
```

Ask a question about the video with reasoning:
Ask a question about the sample video with reasoning:

```shell
./scripts/inference.py --prompt prompts/question.yaml --question 'What are the potential safety hazards?' --reasoning --videos assets/sample.mp4 -v
```

Temporally caption the video and save the input frames to `outputs/temporal_caption_text` for debugging:
Temporally caption the sample video and save the input frames to `outputs/temporal_caption_text` for debugging:

```shell
./scripts/inference.py --prompt prompts/temporal_caption_text.yaml --videos assets/sample.mp4 --timestamp -v -o outputs/temporal_caption_text
```

Configure inference by editing:
You can configure inference by editing the prompt and configuration files:

* [Prompts](prompts/README.md)
* [Sampling Parameters](configs/sampling_params.yaml)
* [Vision Processor Config](configs/vision_config.yaml)
* [Prompts](prompts/README.md)
* [Sampling Parameters](configs/sampling_params.yaml)
* [Vision Processor Config](configs/vision_config.yaml)

## Tutorials

Expand All @@ -101,7 +105,7 @@ We provide example inference scripts:

## Post-Training

The [nvidia-cosmos/cosmos-rl](https://github.com/nvidia-cosmos/cosmos-rl) repository is an async post-training framework specialized for Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). It prioritizes performance, scalability, and fault tolerance.
Cosmos-Reason1 uses the [nvidia-cosmos/cosmos-rl](https://github.com/nvidia-cosmos/cosmos-rl) repository for post-training. It is an async post-training framework specialized for Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). It prioritizes performance, scalability, and fault tolerance.

To support a custom dataset format, use the [minimal Hugging Face example](examples/post_training_hf/README.md) as a template.

Expand Down
92 changes: 44 additions & 48 deletions examples/benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,61 +8,57 @@ This guide provides instructions for evaluating models on the [Cosmos-Reason1 Be

## Setup

Prerequisites:
1. Perform the [Setup](../../README.md#setup) steps outlined in the main README.

- [Setup](../../README.md#setup)

Change directory:
2. Change to the `benchmark` directory::

```shell
cd examples/benchmark
```

## Prepare Dataset

Request access:

- [AgiBotWorld-Beta](https://huggingface.co/datasets/agibot-world/AgiBotWorld-Beta/tree/main)

Download annotations and sample video clips:

```bash
# Download
hf download --repo-type dataset nvidia/Cosmos-Reason1-Benchmark --local-dir data/benchmark
# Unpack
for file in data/benchmark/**/*.tar.gz; do tar -xzf "$file" -C "$(dirname "$file")"; done
```

> **Note:**
> This downloads:
>
> - Annotations for:
> - `AV` # For autonomous vehicles' general description, driving difficulty, and notice
> - [RoboVQA](https://robovqa.github.io/) # Videos, instructions, and question-answer pairs of agents (robots, humans, humans-with-grasping-tools) executing a task.
> - [AgiBot-World](https://github.com/OpenDriveLab/AgiBot-World) # A wide range of real-life tasks for robot manipulation
> - [BridgeData V2](https://rail-berkeley.github.io/bridgedata/) # A wide array of robotic manipulation behaviors
> - [HoloAssist Dataset](https://holoassist.github.io/) # Crucial first-person perspectives that provide natural and immersive understanding of human actions
> - Video clips for:
> - `AV`
> - `RoboVQA`

[Optional] Downloading the full dataset will take a very long time and requires multiple terabytes of disk space:

```bash
./tools/eval/process_raw_data.py --data_dir data --task benchmark
```

> **Note:**
> This downloads:
>
> - Video clips for:
> - AgiBot-World
> - BridgeData V2
> - HoloAssist
## Prepare the Dataset

1. Request access to the [AgiBotWorld-Beta](https://huggingface.co/datasets/agibot-world/AgiBotWorld-Beta/tree/main) dataset.

2. Download annotations and sample video clips:

```bash
# Download
hf download --repo-type dataset nvidia/Cosmos-Reason1-Benchmark --local-dir data/benchmark
# Unpack
for file in data/benchmark/**/*.tar.gz; do tar -xzf "$file" -C "$(dirname "$file")"; done
```

> **Note:**
> The following will be downloaded:
>
> - Annotations:
> - `AV` # For autonomous vehicles' general description, driving difficulty, and notice
> - [RoboVQA](https://robovqa.github.io/) # Videos, instructions, and question-answer pairs of agents (robots, humans, humans-with-grasping-tools) executing a task.
> - [AgiBot-World](https://github.com/OpenDriveLab/AgiBot-World) # A wide range of real-life tasks for robot manipulation
> - [BridgeData V2](https://rail-berkeley.github.io/bridgedata/) # A wide array of robotic manipulation behaviors
> - [HoloAssist Dataset](https://holoassist.github.io/) # Crucial first-person perspectives that provide natural and immersive understanding of human actions
> - Video clips:
> - `AV`
> - `RoboVQA`

3. [Optional] Download the full dataset. This will take a long time and requires multiple terabytes of disk space:

```bash
./tools/eval/process_raw_data.py --data_dir data --task benchmark
```

> **Note:**
> The following will be downloaded:
>
> - Video clips:
> - AgiBot-World
> - BridgeData V2
> - HoloAssist

## Run Evaluation

Configure evaluation settings by editing [`configs/evaluate.yaml`](configs/evaluate.yaml).
Configure evaluation settings by editing the [`configs/evaluate.yaml`](configs/evaluate.yaml) file.

Evaluate the model on the dataset:

Expand All @@ -72,7 +68,7 @@ Evaluate the model on the dataset:

### Calculate Accuracy

Calculate accuracy of the results:
Use the following script to calculate accuracy of the results:

```bash
./tools/eval/calculate_accuracy.py --result_dir outputs/benchmark
Expand All @@ -84,4 +80,4 @@ The script compares model predictions against ground-truth answers:

For open-ended questions, a prediction is considered correct if it exactly matches the ground truth (case-insensitive string match). For multiple-choice questions, the selected option is compared against the correct choice.

> **Note:** These scoring rules follow common practices in VLM QA literature, but users are encouraged to adapt or extend them for specific use cases (e.g., partial credit, VQA-style soft accuracy).
> **Note:** These scoring rules follow common practices in VLM QA literature, but users are encouraged to adapt or extend them for specific use cases (e.g. partial credit, VQA-style soft accuracy).
57 changes: 32 additions & 25 deletions examples/post_training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,61 +6,62 @@ This guide provides instructions for post-training Cosmos-Reason1 on the [SFT](h

## Setup

### Install
### Installation

Prerequisites:
1. Perform the [Setup](../../README.md#setup) steps outlined in the main README.

- [Setup](../../README.md#setup)
2. Install system dependencies:

Install system dependencies:
- [redis](https://redis.io/docs/latest/operate/oss_and_stack/install/archive/install-redis/)

- [redis](https://redis.io/docs/latest/operate/oss_and_stack/install/archive/install-redis/)
```shell
pkgm install redis-server
# or
conda install -c conda-forge redis-server
```

```shell
pkgm install redis-server
# or
conda install -c conda-forge redis-server
```

Install the package:
3. Install the package:

```shell
cd examples/post_training
just install
source .venv/bin/activate
```

### Monitor
### Monitoring

[Optional] We recommend that you to use [wandb](https://wandb.ai/) for training monitoring.
We recommend using [wandb](https://wandb.ai/) to monitor training.

1. Acquire your [WANDB_API_KEY](https://wandb.ai/authorize).
1. Login:

2. Log in to wandb:

```bash
uv tool install -U wandb
wandb login
```

When you run training, you will see the `wandb` link in the logging:
Now, when you run training, you will observe the `wandb` link in the logging:

```bash
wandb: 🚀 View run at https://wandb.ai/${WANDB_USER_NAME}/${config.logging.project_name}/runs/20250515101157
```

## Training

> **_NOTE:_** Following the below training steps will trigger downloading around 200GB of model and dataset files from Hugging Face, please make sure your `~/.cache` directory (or set `HF_HOME` and `COSMOS_CACHE` environment variables to a directory that) has enough storage space.
> **_NOTE:_** Following the below training steps will trigger downloading around 200GB of model and dataset files from Hugging Face. Ensure that your `~/.cache` directory has enough storage space or that the `HF_HOME` and `COSMOS_CACHE` environment variables are set to a directory with enough space.

### Supervised Fine-Tuning (SFT)

The SFT training can improve the model's capability on certain tasks with a similar distribution of the training dataset. E.g., training with `robovqa` dataset can improve the model's performance on the robotics-focused visual question answering scenarios.
SFT training can improve model capability on tasks that have a similar distribution to that of the training dataset: For example, training with the `robovqa` dataset can improve performance with robotics-focused visual question answering scenarios.

Minimum Requirements:
#### Minimum Requirements

- 4 GPUs with 80GB of memory

Configure settings by editing [configs/sft.toml](configs/sft.toml). Variants:
#### Configuration

Configure settings by editing [configs/sft.toml](configs/sft.toml). Variants include the following:

- 8 GPU

Expand All @@ -69,7 +70,9 @@ Configure settings by editing [configs/sft.toml](configs/sft.toml). Variants:
dp_shard_size = 8
```

Run training:
#### Training

Run training as follows:

```shell
cosmos-rl --config configs/sft.toml ./tools/dataset/cosmos_sft.py
Expand All @@ -83,13 +86,15 @@ After training finishes, the final output checkpoint can be found in the log:

### Reinforcement Learning (RL)

The RL training can improve the model's reasoning capability on certain tasks with the reasoning training dataset.
RL training can improve model reasoning capability on certain tasks with the reasoning training dataset.

Minimum Requirements:
#### Minimum Requirements

- 4 GPUs with 80GB of memory

Configure settings by editing [configs/rl.toml](configs/rl.toml). Variants:
#### Configuration

Configure settings by editing [configs/rl.toml](configs/rl.toml). Variants include the following:

- 8 GPU

Expand All @@ -101,7 +106,9 @@ Configure settings by editing [configs/rl.toml](configs/rl.toml). Variants:
dp_shard_size = 4
```

Run training:
#### Training

Run training as follows:

```shell
cosmos-rl --config configs/rl.toml tools/dataset/cosmos_grpo.py
Expand Down
Loading