Skip to content

Conversation

@alexcbb
Copy link

@alexcbb alexcbb commented Jun 17, 2025

On clusters accessing internet is not possible sometimes and the get_tokenizer function rise an error. This commit add an option to fix the issue by providing a way to provide a local path containing the vocabulary.

On clusters accessing internet is not possible sometimes and the get_tokenizer function rise an error. This commit add an option to fix the issue by providing a way to provide a local path containing the vocabulary.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2025
Comment on lines +73 to +75
response = requests.get(url)
response.raise_for_status()
content = response.content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cijose Do you remember why requests was pulled as dependency? Couldn't we simply use builtin Python modules instead?

    with urllib.request.urlopen(url) as f:
        content = f.read()

response = requests.get(url)
response.raise_for_status()
file_buf = BytesIO(response.content)
if not local_path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use urllib.parse.urlparse(bpe_path).scheme to distinguish between a URL with a scheme or something that looks like an actual local path?



def get_tokenizer():
def get_tokenizer(local_path=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: def get_tokenizer(bpe_path_or_url: Optional[str] = None)

@patricklabatut
Copy link
Contributor

the get_tokenizer function rise an error

Which error do you get?

@alexcbb
Copy link
Author

alexcbb commented Jun 17, 2025

the get_tokenizer function rise an error

Which error do you get?

It's just returning a 403 error as the GPU nodes on the cluster are not connected to internet (due to security issues), so unable to request an online url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants