Skip to content

Conversation

hanouticelina
Copy link
Contributor

@hanouticelina hanouticelina commented Oct 13, 2025

Resolves #3432.

The goal of this PR is to completely revamp the hf cache CLI by taking inspiration from the docker CLI mainly:

The following DX is taken from the related issue mentioned above:


hf cache ls

List all cached repos

Kinda equivalent of current hf cache scan

> hf cache ls

ID                          SIZE     LAST_ACCESSED LAST_MODIFIED REFS         
--------------------------- -------- ------------- ------------- ------------ 
dataset/nyu-mll/glue          157.4M 2 hours ago   2 hours ago   main, script 
model/LiquidAI/LFM2-VL-1.6B     3.2G 2 hours ago   2 hours ago   main         
model/microsoft/UserLM-8b      32.1G 2 hours ago   2 hours ago   main     

List all revisions

Equivalent of current hf cache scan --verbose

> hf cache ls --revisions

ID                          REVISION                                 SIZE     LAST_MODIFIED REFS   
--------------------------- ---------------------------------------- -------- ------------- ------ 
dataset/nyu-mll/glue        06983ca6c77a72be60a85e65634e385b5cbe995c    91.9K 2 hours ago   script 
dataset/nyu-mll/glue        bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c   157.3M 2 hours ago   main   
model/LiquidAI/LFM2-VL-1.6B 057c4ae6a3cd3d09ac3e8e2a2a46dae6d9baf4c4     3.2G 2 hours ago   main   
model/LiquidAI/LFM2-VL-1.6B c9a05f855fe52b0e7aab40f835296693cdca20f0     3.2G 2 hours ago          
model/microsoft/UserLM-8b   be8f2069189bdf443e554c24e488ff3ff6952691    32.1G 2 hours ago   main 

Filter by total size

> hf cache ls --filter "size>3G"

ID                          SIZE     LAST_ACCESSED LAST_MODIFIED REFS 
--------------------------- -------- ------------- ------------- ---- 
model/LiquidAI/LFM2-VL-1.6B     3.2G 2 hours ago   2 hours ago   main 
model/microsoft/UserLM-8b      32.1G 2 hours ago   2 hours ago   main 

The filters are case insensitive.

Filter by last modified or accessed

Docker is able to handle many time representation ("7d", "2024-05-01", a timestamp, iso date format, timezone, etc.). In practice it's already good if we can handle semantic terms like 10s, m, h, d, w, mo and y + timestamps.

hf cache ls --filter "modified>7d"
hf cache ls --filter "accessed>1y"

Filter by repo type

hf cache ls --filter "type=dataset"

Combine filters

e.g. "give me all models of at least 1MB and not accessed for a year)"

hf cache ls --filter "type=model" --filter "size>1000000" --filter "accessed>1y"

Filters are processed as logical AND. Let's not support "OR".

Quiet mode: print only ids

hf cache ls --filter "accessed>1y" -q
hf cache ls --filter "accessed>1y" --quiet

Custom format

Default output format is to print as a table. But one could want to have a CSV or JSON. Docker handles custom templates but we don't need that much flexibility.

> hf cache ls --format json
[
  {
    "id": "dataset/nyu-mll/glue",
    "repo_id": "nyu-mll/glue",
    "repo_type": "dataset",
    "size_on_disk": 157432781,
    "size_on_disk_str": "157.4M",
    "last_accessed": 1760362335.1002831,
    "last_accessed_str": "2 hours ago",
    "last_modified": 1760362335.2425795,
    "last_modified_str": "2 hours ago",
    "refs": [
      "main",
      "script"
    ]
  },
  {
    "id": "model/LiquidAI/LFM2-VL-1.6B",
    "repo_id": "LiquidAI/LFM2-VL-1.6B",
    "repo_type": "model",
    "size_on_disk": 3174603148,
    "size_on_disk_str": "3.2G",
    "last_accessed": 1760360737.6838264,
    "last_accessed_str": "2 hours ago",
    "last_modified": 1760360737.822951,
    "last_modified_str": "2 hours ago",
    "refs": [
      "main"
    ]
  },
  {
    "id": "model/microsoft/UserLM-8b",
    "repo_id": "microsoft/UserLM-8b",
    "repo_type": "model",
    "size_on_disk": 32138410181,
    "size_on_disk_str": "32.1G",
    "last_accessed": 1760360127.547946,
    "last_accessed_str": "2 hours ago",
    "last_modified": 1760360395.6360602,
    "last_modified_str": "2 hours ago",
    "refs": [
      "main"
    ]
  }
]
> hf cache ls --format csv
id,repo_type,size_bytes,size,last_accessed,last_accessed_str,last_modified,last_modified_str,refs
dataset/nyu-mll/glue,dataset,157432781,157.4M,1760362335.1002831,2 hours ago,1760362335.2425795,2 hours ago,main; script
model/LiquidAI/LFM2-VL-1.6B,model,3174603148,3.2G,1760360737.6838264,2 hours ago,1760360737.822951,2 hours ago,main
model/microsoft/UserLM-8b,model,32138410181,32.1G,1760360127.547946,2 hours ago,1760360395.6360602,2 hours ago,main

hf cache rm

Delete specific revision(s)

hf cache rm 9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e
hf cache rm 9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e 1bb3f918c345c9d351dd5434c6fda5153506f8c5

Delete specific repo(s)

hf cache rm model/meta-llama/Llama-2-70b-hf
hf cache rm model/meta-llama/Llama-2-70b-hf dataset/facebook/wiki_dpr

Delete repos based on a query

Same as for docker, we use the quiet mode
e.g. "delete all repos not accessed in the last year"

hf cache rm $(hf cache ls --filter "accessed>1y" -q)

or on unix:

hf cache ls --filter "accessed>1y" -q | xargs hf cache rm

Confirmation step / dry-run

It would be good to have a confirmation step by default.

hf cache rm ... -y

Alternatively (or in addition), we could have a dry-run mode:

hf cache rm ... --dry-run

hf cache prune

Delete all detached revision

When downloading the same repo over time, the user might get several revisions in cache. Revisions can be linked to git refs (e.g. main, refs/pr/2, etc.) or "detached". Pruning the cache will delete all revisions not explicitly bound to a reference.

In practice, if a user has always downloaded from main, all revisions will be deleted except the last one.

hf cache prune
About to delete 18 unreferenced revisions (2.4 GB total)
Proceed? [y/N]:

Confirmation step / dry-run

Same as for hf cache rm.

hf cache prune -y
hf cache prune --dry-run

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Contributor

Wauplin commented Oct 14, 2025

Encountered a weird but while testing the CLI:

(.venv) ➜  huggingface_hub git:(revamp-hf-cache) ✗ hf cache ls --filter "accessed>18mo" -f "size>1g" -q                     
model/google/gemma-7b-it
model/pyp1/VoiceCraft_giga330M
(.venv) ➜  huggingface_hub git:(revamp-hf-cache) ✗ hf cache ls --filter "accessed>18mo" -f "size>1g" -q | xargs hf cache rm 
About to delete 2 repo(s) totalling 6.3G.
  - model/google/gemma-7b-it (entire repo)
  - model/pyp1/VoiceCraft_giga330M (entire repo)
Proceed with deletion? [y/N]: Aborted!

I suspect xargs to pass a last Enter to the input which abort deletion.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! To be fair it was quite a long one to review ^^ Next time I think it'd be great to split it in smaller ones and leave the optional stuff for later (like the --format parameter or implementing all the filters at once). It would make things easier to review/iterate on. Anyway, now that it's here, let's keep everything 😄

@hanouticelina hanouticelina marked this pull request as ready for review October 16, 2025 08:53
@hanouticelina hanouticelina requested a review from Wauplin October 16, 2025 08:54
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice job! 🔥 I've left some comments but only minor ones so feel free to merge once addressed!


I gave it a try, which was not unpleasant 😄

hf cache rm $(hf cache ls --filter "accessed>1y" -q) -y

Cache deletion done. Saved 137.4G.
Deleted 576 repo(s) and 1343 revision(s); freed 137.4G.

Comment on lines +156 to +158
def format_cache_repo_id(repo: CachedRepoInfo) -> str:
"""Return the canonical `type/id` string used across cache CLI outputs."""
return f"{repo.repo_type}/{repo.repo_id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) could be a property of CachedRepoInfo

expected = value_raw.lower()

if op != "=":
raise ValueError("Only '=' is supported for 'type' filters.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Only '=' is supported for 'type' filters.")
raise ValueError(f"Only '=' is supported for 'type' filters. Got '{op}'.")

Comment on lines +267 to +268
if include_revisions:
table_rows: List[List[str]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if include_revisions:
table_rows: List[List[str]] = []
table_rows: List[List[str]]
if include_revisions:

is the table_rows: List[List[str]] = [] line defined only for type hints issues? If yes, would this suggestion work? Otherwise ok to keep like this^^

Comment on lines +305 to +306
print()
summary = f"Found {repo_count} repo(s) for a total of {revision_count} revision(s) and {_format_size(total_size)} on disk."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print()
summary = f"Found {repo_count} repo(s) for a total of {revision_count} revision(s) and {_format_size(total_size)} on disk."
summary = f"\nFound {repo_count} repo(s) for a total of {revision_count} revision(s) and {_format_size(total_size)} on disk."

(nit)

#
# Once you've manually reviewed this file, please confirm deletion in the terminal. This file will be automatically removed once done.
# ------------
@cache_cli.command(help="Remove detached revisions from the cache.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@cache_cli.command(help="Remove detached revisions from the cache.")
@cache_cli.command()

(same)

),
] = False,
) -> None:
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
"""Remove detached revisions from the cache."""
try:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing all the tests!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a ./tests/test_utils_parsing.py module with a few tests for parse_size and parse_duration? (you can take tests from https://gist.github.com/Wauplin/a7a385db01ddbf0067325ec02bc35ce0)

msg=f"Wrong formatting for {size} == '{expected}'",
)

def test_format_timesince(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and move this test to ./tests/test_utils_parsing.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants