Skip to content

Add repository fallback for experimental checkpoint downloads#198

Open
recrack wants to merge 3 commits into
nvidia-cosmos:mainfrom
recrack:feature/checkpoint-repository-fallback
Open

Add repository fallback for experimental checkpoint downloads#198
recrack wants to merge 3 commits into
nvidia-cosmos:mainfrom
recrack:feature/checkpoint-repository-fallback

Conversation

@recrack

@recrack recrack commented Mar 10, 2026

Copy link
Copy Markdown

Fixes: #197

Problem

Users may lack permission to nvidia/Cosmos-Experimental but have access to nvidia-cosmos-ea/Cosmos-Experimental. Current implementation fails immediately without trying the alternative repository.

Summary

Add automatic repository fallback for experimental checkpoints. When nvidia/Cosmos-Experimental download fails, automatically retry with nvidia-cosmos-ea/Cosmos-Experimental.

Changes

cosmos_transfer2/_src/imaginaire/utils/checkpoint_db.py

  • Added repository list logic in CheckpointFileHf._download()
  • Implemented try-except loop to retry with fallback repository on failure
  • Added informative RuntimeError when all repositories fail

cosmos_transfer2/_src/imaginaire/utils/checkpoint_db_test.py

  • test_experimental_repository_fallback: Verifies successful fallback from primary to alternative repository
  • test_non_experimental_repository_no_fallback: Ensures non-experimental repositories don't use fallback behavior
  • test_both_repositories_fail: Confirms RuntimeError is raised when all repositories fail

Test Results: All 3 tests pass successfully:

cosmos_transfer2/_src/imaginaire/utils/checkpoint_db_test.py::test_experimental_repository_fallback PASSED [ 33%]
cosmos_transfer2/_src/imaginaire/utils/checkpoint_db_test.py::test_non_experimental_repository_no_fallback PASSED [ 66%]
cosmos_transfer2/_src/imaginaire/utils/checkpoint_db_test.py::test_both_repositories_fail PASSED [100%]

3 passed, 7 warnings in 0.83s

@recrack

recrack commented Mar 10, 2026

Copy link
Copy Markdown
Author

'll run lint checks according to the contributing guidelines and submit a fix commit

recrack added 3 commits March 12, 2026 10:10
Add fallback logic to CheckpointFileHf._download() to support both
nvidia/Cosmos-Experimental and nvidia-cosmos-ea/Cosmos-Experimental
repositories. When downloading experimental checkpoints, try the original
repository first, then fallback to the alternative repository if the first
attempt fails. This ensures checkpoint downloads work even when only one
of the repositories is accessible.

Signed-off-by: Youngmin Yoo <recrack@gmail.com>
…fallback

- Fix test_non_experimental_repository_no_fallback by adding os.path.exists mock
- Ensures consistency with test_experimental_repository_fallback
- All three repository fallback tests now pass successfully

Signed-off-by: Youngmin Yoo <recrack@gmail.com>
- Fix 3 ruff errors (auto-fixed)
- Format 2 files with ruff format
- Pass all pre-commit checks
@recrack recrack force-pushed the feature/checkpoint-repository-fallback branch from 186ac06 to b9a5a1a Compare March 12, 2026 01:10
@recrack

recrack commented Mar 12, 2026

Copy link
Copy Markdown
Author

Updated. test result.


just lint
uv tool install "pre-commit>=4.3.0"
`pre-commit>=4.3.0` is already installed
pre-commit install -c .pre-commit-config-base.yaml
pre-commit installed at .git/hooks/pre-commit
pre-commit run -a  || pre-commit run -a
addlicense...............................................................Passed
markdown-toc-creator.....................................................Passed
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check for broken symlinks............................(no files to check)Skipped
Generate uv lock files for projects......................................Passed
Generate uv lock files for scripts.......................................Passed
ruff fix.................................................................Passed
ruff format..............................................................Passed
link check relative......................................................Passed
$ just test
...
tests/assets_test.py::test_inference_assets[car_example/multicontrol]
tests/assets_test.py::test_inference_assets[image_example]
tests/assets_test.py::test_inference_assets[robot_example/edge]
tests/assets_test.py::test_inference_assets[robot_example/vis0]
tests/assets_test.py::test_inference_assets[robot_example/depth]
tests/assets_test.py::test_inference_assets[car_example/edge]
tests/assets_test.py::test_inference_assets[car_example/seg]
tests/assets_test.py::test_inference_assets[multiview_example]
tests/assets_test.py::test_inference_assets[robot_example/vis1]
tests/assets_test.py::test_inference_assets[robot_multiview_control-agibot]
[gw3] [  8%] PASSED tests/assets_test.py::test_inference_assets[image_example]
[gw0] [ 16%] PASSED tests/assets_test.py::test_inference_assets[car_example/edge]
[gw6] [ 25%] PASSED tests/assets_test.py::test_inference_assets[robot_example/vis1]
[gw1] [ 33%] PASSED tests/assets_test.py::test_inference_assets[car_example/seg]
[gw5] [ 41%] PASSED tests/assets_test.py::test_inference_assets[robot_example/edge]
[gw2] [ 50%] PASSED tests/assets_test.py::test_inference_assets[car_example/multicontrol]
[gw4] [ 58%] PASSED tests/assets_test.py::test_inference_assets[robot_example/depth]
[gw9] [ 66%] PASSED tests/assets_test.py::test_inference_assets[multiview_example]
tests/assets_test.py::test_inference_assets[robot_example/multicontrol]
[gw10] [ 75%] PASSED tests/assets_test.py::test_inference_assets[robot_example/multicontrol]
[gw13] [ 83%] PASSED tests/assets_test.py::test_inference_assets[robot_multiview_control-agibot]
tests/assets_test.py::test_inference_assets[robot_example/seg]
[gw8] [ 91%] PASSED tests/assets_test.py::test_inference_assets[robot_example/seg]
[gw7] [100%] PASSED tests/assets_test.py::test_inference_assets[robot_example/vis0]

============================================================================================= slowest 5 durations ==============================================================================================
0.00s call     tests/assets_test.py::test_inference_assets[multiview_example]
0.00s call     tests/assets_test.py::test_inference_assets[robot_multiview_control-agibot]
0.00s call     tests/assets_test.py::test_inference_assets[car_example/seg]
0.00s call     tests/assets_test.py::test_inference_assets[robot_example/multicontrol]
0.00s call     tests/assets_test.py::test_inference_assets[car_example/multicontrol]
============================================================================================== 12 passed in 4.90s ==============================================================================================
+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Experimental checkpoint download fails when primary repository is unavailable

1 participant