Set up HF_HUB_CACHE for OSSCI Machines #926

renxida · 2025-02-06T18:38:24Z

Our integration tests currently download weights for many huggingface models using

hf_datasets.py

Currently, caching is not working on:

the new ossci cluster, labeled as linux-mi300-gpu-1
the azure-cpubuilder-linux-scale

Which causes 20+ minute model weight re-downloads on every CI run for shortfin llm

To address this, we need:

a writeable space
with enough space to hold a significant portion of the models in /home/xidaren2/shark-ai/sharktank/sharktank/utils/hf_datasets.py (my estimation is on the scale of hundreds of GBs)
supports symlinking to the CI working dir
(optionally) periodically cleared or downsized to evict old models

And set this directory as an environment variable
HF_HUB_CACHE
for CI tasks that uses it.

This would speed up sharktank and shortfin integration tests by a lot. In my previous experiments, caching hf_hub can cause a 40 minute test to complete in 6 minutes.

The text was updated successfully, but these errors were encountered:

renxida · 2025-02-07T17:18:31Z

List of ci files that depend on this:

sharktank data-dependent tests: https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/ci-sharktank.yml
shortfin llm integration tests: https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/pkgci_shark_ai.yml

renxida assigned renxida and Eliasj42 and unassigned renxida Feb 7, 2025

renxida mentioned this issue Feb 7, 2025

[CI][sharktank] Move Sharktank Data-Dependent Tests to OSSCI Cluster #932

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up HF_HUB_CACHE for OSSCI Machines #926

Set up HF_HUB_CACHE for OSSCI Machines #926

renxida commented Feb 6, 2025

renxida commented Feb 7, 2025

Set up HF_HUB_CACHE for OSSCI Machines #926

Set up HF_HUB_CACHE for OSSCI Machines #926

Comments

renxida commented Feb 6, 2025

renxida commented Feb 7, 2025