-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Enable local loading of tokenizer for dino.txt #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
On clusters accessing internet is not possible sometimes and the get_tokenizer function rise an error. This commit add an option to fix the issue by providing a way to provide a local path containing the vocabulary.
| response = requests.get(url) | ||
| response.raise_for_status() | ||
| content = response.content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cijose Do you remember why requests was pulled as dependency? Couldn't we simply use builtin Python modules instead?
with urllib.request.urlopen(url) as f:
content = f.read()
| response = requests.get(url) | ||
| response.raise_for_status() | ||
| file_buf = BytesIO(response.content) | ||
| if not local_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use urllib.parse.urlparse(bpe_path).scheme to distinguish between a URL with a scheme or something that looks like an actual local path?
|
|
||
|
|
||
| def get_tokenizer(): | ||
| def get_tokenizer(local_path=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: def get_tokenizer(bpe_path_or_url: Optional[str] = None)
Which error do you get? |
It's just returning a 403 error as the GPU nodes on the cluster are not connected to internet (due to security issues), so unable to request an online url |
On clusters accessing internet is not possible sometimes and the get_tokenizer function rise an error. This commit add an option to fix the issue by providing a way to provide a local path containing the vocabulary.