fix: add ci test download cache with refined retry mechanism#26
Merged
Conversation
GitHub-raw input assets intermittently return transient gateway errors (502/503/504). obstore retries internally but only within a short (~15s) window, so a 504 burst fails the whole test/inference run. Wrap each download in an outer retry with exponential backoff + jitter (6 attempts, 4s base, 60s cap by default; all env-overridable via COSMOS_DOWNLOAD_*). Permanent errors (400/401/403/404, NotFound/Permission) fail fast instead of burning the budget. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an opt-in persistent download cache: when COSMOS_DOWNLOAD_CACHE_DIR is set, input-asset URLs are cached there keyed by URL and reused across runs; when unset, behavior is unchanged (throwaway temp dir). The asset URLs are pinned to a content commit (.../raw/<sha>/), so the URL fully identifies the bytes -- a cache hit skips the network entirely, so the flaky GitHub-raw 504s are hit at most once per asset (the backoff retry is the cold-cache fallback). Concurrent workers download to a temp file and atomically os.replace it into the cache. Also replaces the leaking TemporaryDirectory(delete=False) with mkdtemp. Enable it in the unittest CI job via $RUNNER_WORKSPACE/cosmos_input_cache -- outside the repo tree, so the cleanup step doesn't wipe it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
action/forward_dynamics in nano_inference_smoke_test download the same GitHub-raw input assets as the unittest job, so point inference-smoke at the same persistent $RUNNER_WORKSPACE/cosmos_input_cache. Both jobs (and runs across PRs on the same self-hosted runner) now share one warm cache. training-smoke (text2image) and the regression jobs (HF datasets) don't fetch these inputs, so they're left unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trim the comments added for the download retry/cache to one-liners; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_download_file covers the download/symlink/meta mechanics, which assume each download is independent (force-re-download lands at a fresh path). The opt-in cache (COSMOS_DOWNLOAD_CACHE_DIR, set in the CI job) dedups by URL and broke that assumption. Disable the cache within the test via monkeypatch so it's deterministic regardless of the ambient CI env. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
foreverlms
approved these changes
Jun 8, 2026
Dinghow
approved these changes
Jun 9, 2026
rahul-steiger-nv
pushed a commit
to rahul-steiger-nv/cosmos-framework
that referenced
this pull request
Jun 15, 2026
### Summary CI tests download input assets (e.g. action/video inputs) over the network, and these intermittently fail with transient gateway errors (502/503/504), flaking the run. This PR makes those downloads robust and avoids re-fetching the same assets every run. ### Changes - **Backoff retry** (`inference/common/args.py`): wrap each input download in an outer retry with exponential backoff + jitter (6 attempts, env-overridable via `COSMOS_DOWNLOAD_*`). Permanent errors (400/401/403/404) fail fast. - **Opt-in download cache**: when `COSMOS_DOWNLOAD_CACHE_DIR` is set, downloads are cached by URL and reused across runs; unset → unchanged behavior. Concurrent writers use an atomic move. - **CI wiring** (`gpu-tests.yml`): the `unittest` and `inference-smoke` jobs point at a shared persistent cache dir (`$RUNNER_WORKSPACE/cosmos_input_cache`, outside the repo tree so cleanup keeps it), reused across runs and PRs on the same runner. ### Impact - Production/local behavior unchanged: cache is off unless the env var is set; retry is transparent on success and only adds resilience on failure. - Only new persisted artifact is the cache dir; replaces previously-leaked `/tmp` temp dirs in those jobs. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CI tests download input assets (e.g. action/video inputs) over the network, and these intermittently fail with transient gateway errors (502/503/504), flaking
the run. This PR makes those downloads robust and avoids re-fetching the same assets every run.
Changes
inference/common/args.py): wrap each input download in an outer retry with exponential backoff + jitter (6 attempts, env-overridable viaCOSMOS_DOWNLOAD_*). Permanent errors (400/401/403/404) fail fast.COSMOS_DOWNLOAD_CACHE_DIRis set, downloads are cached by URL and reused across runs; unset → unchanged behavior.Concurrent writers use an atomic move.
gpu-tests.yml): theunittestandinference-smokejobs point at a shared persistent cache dir ($RUNNER_WORKSPACE/cosmos_input_cache,outside the repo tree so cleanup keeps it), reused across runs and PRs on the same runner.
Impact
/tmptemp dirs in those jobs.