Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve: Improve progress bar #431

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

hh-space-invader
Copy link
Contributor

@hh-space-invader hh-space-invader commented Dec 30, 2024

Problem:

  • If the user tried to download the model, then stopped the process, then tried to download again, it does not show progress bar.

Suggestion:

  • Create a metadata file after the download is complete, and load before download.

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  • Does your submission pass the existing tests?
  • Have you added tests for your feature?
  • Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

  • Have you added an explanation of why it's important to include this model?
  • Have you added tests for the new model? Were canonical values for tests computed via the original model?
  • Have you added the code snippet for how canonical values were computed?
  • Have you successfully ran tests with your changes locally?

fastembed/common/model_management.py Outdated Show resolved Hide resolved
Comment on lines +163 to +164
if not os.path.exists(os.path.join(result, model_file)):
raise FileNotFoundError("Couldn't download model from huggingface")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if any of the other required files is missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, the blobs are downloaded first, then its extracted to .onnx and the required files. So if the .onnx exists, then the blobs extracted everything correctly. U can check that by deleting the snapshot folder and try to run the model. The snapshot folder will be extracted again and the .onnx (and other files) will exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is redundant, let's rely on hf here
also we have models without onnx files like Qdrant/bm25 and then it just checks existence of a directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we can also remove model_file variable then

fastembed/common/model_management.py Show resolved Hide resolved
fastembed/common/model_management.py Outdated Show resolved Hide resolved

if is_cached:
disable_progress_bars()
if snapshot_dir.exists() and metadata_file.exists():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if version on hf is different from the one we have locally, then we will hide the progress bar and then we will silently download the updated files
I think we could make a corresponding call to HfApi and check revision and commit hash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u explain it further ? Cuz as far as Andrey's said, he don't want to make a call to HFapi as it requires network. Do u mean that we can call it only while downloading on the first time and add revision to metadata ? And only call it when there;s network ?

Copy link
Contributor Author

@hh-space-invader hh-space-invader Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a _get_file_hash function to compute hash for each file and then later on checked with _verify_files_from_metadata if the version changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to make this call if local_files_only != True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc in snapshot_download they just pull the whole repo info and compare revision's commit hahs

@hh-space-invader hh-space-invader requested a review from joein January 8, 2025 06:09
Comment on lines +163 to +164
if not os.path.exists(os.path.join(result, model_file)):
raise FileNotFoundError("Couldn't download model from huggingface")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is redundant, let's rely on hf here
also we have models without onnx files like Qdrant/bm25 and then it just checks existence of a directory

Comment on lines +163 to +164
if not os.path.exists(os.path.join(result, model_file)):
raise FileNotFoundError("Couldn't download model from huggingface")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we can also remove model_file variable then

Comment on lines +146 to +152
try: # there's network at least first time
url = hf_hub_url(hf_source_repo, file_path.name)
hf_metadata = get_hf_file_metadata(url)
metadata[str(file_path.relative_to(model_dir))] = {
"size": hf_metadata.size,
"commit_hash": hf_metadata.commit_hash,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we retrieve information about the whole repository instead of retrieving files one by one?


if is_cached:
disable_progress_bars()
if snapshot_dir.exists() and metadata_file.exists():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc in snapshot_download they just pull the whole repo info and compare revision's commit hahs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants