Skip to content

fix: add ci test download cache with refined retry mechanism#26

Merged
lfengad merged 7 commits into
mainfrom
colocated-tests-ci
Jun 9, 2026
Merged

fix: add ci test download cache with refined retry mechanism#26
lfengad merged 7 commits into
mainfrom
colocated-tests-ci

Conversation

@lfengad

@lfengad lfengad commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

CI tests download input assets (e.g. action/video inputs) over the network, and these intermittently fail with transient gateway errors (502/503/504), flaking
the run. This PR makes those downloads robust and avoids re-fetching the same assets every run.

Changes

  • Backoff retry (inference/common/args.py): wrap each input download in an outer retry with exponential backoff + jitter (6 attempts, env-overridable via
    COSMOS_DOWNLOAD_*). Permanent errors (400/401/403/404) fail fast.
  • Opt-in download cache: when COSMOS_DOWNLOAD_CACHE_DIR is set, downloads are cached by URL and reused across runs; unset → unchanged behavior.
    Concurrent writers use an atomic move.
  • CI wiring (gpu-tests.yml): the unittest and inference-smoke jobs point at a shared persistent cache dir ($RUNNER_WORKSPACE/cosmos_input_cache,
    outside the repo tree so cleanup keeps it), reused across runs and PRs on the same runner.

Impact

  • Production/local behavior unchanged: cache is off unless the env var is set; retry is transparent on success and only adds resilience on failure.
  • Only new persisted artifact is the cache dir; replaces previously-leaked /tmp temp dirs in those jobs.

lfengad and others added 6 commits June 8, 2026 04:01
GitHub-raw input assets intermittently return transient gateway errors
(502/503/504). obstore retries internally but only within a short (~15s)
window, so a 504 burst fails the whole test/inference run. Wrap each download
in an outer retry with exponential backoff + jitter (6 attempts, 4s base, 60s
cap by default; all env-overridable via COSMOS_DOWNLOAD_*). Permanent errors
(400/401/403/404, NotFound/Permission) fail fast instead of burning the budget.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an opt-in persistent download cache: when COSMOS_DOWNLOAD_CACHE_DIR is set,
input-asset URLs are cached there keyed by URL and reused across runs; when
unset, behavior is unchanged (throwaway temp dir). The asset URLs are pinned to
a content commit (.../raw/<sha>/), so the URL fully identifies the bytes -- a
cache hit skips the network entirely, so the flaky GitHub-raw 504s are hit at
most once per asset (the backoff retry is the cold-cache fallback). Concurrent
workers download to a temp file and atomically os.replace it into the cache.
Also replaces the leaking TemporaryDirectory(delete=False) with mkdtemp.

Enable it in the unittest CI job via $RUNNER_WORKSPACE/cosmos_input_cache --
outside the repo tree, so the cleanup step doesn't wipe it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
action/forward_dynamics in nano_inference_smoke_test download the same
GitHub-raw input assets as the unittest job, so point inference-smoke at the
same persistent $RUNNER_WORKSPACE/cosmos_input_cache. Both jobs (and runs across
PRs on the same self-hosted runner) now share one warm cache. training-smoke
(text2image) and the regression jobs (HF datasets) don't fetch these inputs, so
they're left unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trim the comments added for the download retry/cache to one-liners; no behavior
change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_download_file covers the download/symlink/meta mechanics, which assume each
download is independent (force-re-download lands at a fresh path). The opt-in
cache (COSMOS_DOWNLOAD_CACHE_DIR, set in the CI job) dedups by URL and broke that
assumption. Disable the cache within the test via monkeypatch so it's
deterministic regardless of the ambient CI env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lfengad lfengad enabled auto-merge (squash) June 8, 2026 12:53
@lfengad lfengad merged commit 3addb71 into main Jun 9, 2026
7 checks passed
@lfengad lfengad deleted the colocated-tests-ci branch June 9, 2026 06:54
rahul-steiger-nv pushed a commit to rahul-steiger-nv/cosmos-framework that referenced this pull request Jun 15, 2026
### Summary
CI tests download input assets (e.g. action/video inputs) over the
network, and these intermittently fail with transient gateway errors
(502/503/504), flaking
the run. This PR makes those downloads robust and avoids re-fetching the
same assets every run.
### Changes
- **Backoff retry** (`inference/common/args.py`): wrap each input
download in an outer retry with exponential backoff + jitter (6
attempts, env-overridable via
`COSMOS_DOWNLOAD_*`). Permanent errors (400/401/403/404) fail fast.
- **Opt-in download cache**: when `COSMOS_DOWNLOAD_CACHE_DIR` is set,
downloads are cached by URL and reused across runs; unset → unchanged
behavior.
Concurrent writers use an atomic move.
- **CI wiring** (`gpu-tests.yml`): the `unittest` and `inference-smoke`
jobs point at a shared persistent cache dir
(`$RUNNER_WORKSPACE/cosmos_input_cache`,
outside the repo tree so cleanup keeps it), reused across runs and PRs
on the same runner.
### Impact
- Production/local behavior unchanged: cache is off unless the env var
is set; retry is transparent on success and only adds resilience on
failure.
- Only new persisted artifact is the cache dir; replaces
previously-leaked `/tmp` temp dirs in those jobs.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants