Skip to content

Synchronize distributed remote downloads#65

Open
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/distributed-download-sync
Open

Synchronize distributed remote downloads#65
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/distributed-download-sync

Conversation

@fallintoplace

Copy link
Copy Markdown

Summary

  • synchronize remote download_file() calls across distributed ranks before returning the resolved path
  • propagate rank-0 download failures to every rank instead of letting non-rank0 proceed with a stale or missing path
  • validate that the final downloaded file exists after synchronization and add a hermetic regression test for the success, failure, and missing-file cases

Why

download_file() only fetched on rank 0, but the other ranks returned the target path immediately. In multi-GPU inference that lets non-rank0 code race ahead and try to open the file before rank 0 finishes wiring up the symlink and metadata.

I used broadcast_object_list() instead of a plain barrier so the same rendezvous also carries any rank-0 download error to the other ranks, which avoids turning an early rank-0 failure into a later hang or a confusing per-rank file error.

Validation

  • uvx ruff check /Users/hoangvu/Code/OSS/cosmos-framework/cosmos_framework/inference/common/args.py /Users/hoangvu/Code/OSS/cosmos-framework/cosmos_framework/inference/common/download_file_sync_test.py
  • uvx ruff format --check /Users/hoangvu/Code/OSS/cosmos-framework/cosmos_framework/inference/common/args.py /Users/hoangvu/Code/OSS/cosmos-framework/cosmos_framework/inference/common/download_file_sync_test.py
  • PYTHONPATH=/Users/hoangvu/Code/OSS/cosmos-framework uvx pytest --noconftest /Users/hoangvu/Code/OSS/cosmos-framework/cosmos_framework/inference/common/download_file_sync_test.py -o addopts=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant