Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up HF_HUB_CACHE for OSSCI Machines #926

Open
renxida opened this issue Feb 6, 2025 · 1 comment
Open

Set up HF_HUB_CACHE for OSSCI Machines #926

renxida opened this issue Feb 6, 2025 · 1 comment
Assignees

Comments

@renxida
Copy link
Contributor

renxida commented Feb 6, 2025

Our integration tests currently download weights for many huggingface models using

hf_datasets.py

Currently, caching is not working on:

  • the new ossci cluster, labeled as linux-mi300-gpu-1
  • the azure-cpubuilder-linux-scale

Which causes 20+ minute model weight re-downloads on every CI run for shortfin llm

To address this, we need:

  • a writeable space
  • with enough space to hold a significant portion of the models in /home/xidaren2/shark-ai/sharktank/sharktank/utils/hf_datasets.py (my estimation is on the scale of hundreds of GBs)
  • supports symlinking to the CI working dir
  • (optionally) periodically cleared or downsized to evict old models

And set this directory as an environment variable
HF_HUB_CACHE
for CI tasks that uses it.

This would speed up sharktank and shortfin integration tests by a lot. In my previous experiments, caching hf_hub can cause a 40 minute test to complete in 6 minutes.

@renxida renxida assigned renxida and Eliasj42 and unassigned renxida Feb 7, 2025
@renxida
Copy link
Contributor Author

renxida commented Feb 7, 2025

List of ci files that depend on this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants