Skip to content

Conversation

@mnoukhov
Copy link
Contributor

@mnoukhov mnoukhov commented Nov 12, 2025

copies down the tokenizer, tokenizer_config, and config to weka in order to do dataset preprocessing


Note

When --model_name_or_path is a gs:// path, download tokenizer/config files to a local cache for preprocessing, and allow download_from_gs_bucket to accept multiple sources.

  • Open Instruct training (mason.py):
    • Detect --model_name_or_path with gs:// and download tokenizer.json, tokenizer_config.json, and config.json to a local cache ({auto_output_dir_path}/{whoami}/tokenizer_<hash>/), then substitute this path for dataset preprocessing.
  • Utilities (open_instruct/utils.py):
    • Update download_from_gs_bucket to accept a single path or list of paths and create the destination directory before copy.

Written by Cursor Bugbot for commit 98e7976. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial functionality to support dataset preprocessing for models stored exclusively on Google Cloud Storage. It ensures that when a model path points to a GS bucket, the required tokenizer and configuration files are fetched and cached locally, thereby integrating GS-hosted models seamlessly into the existing data preparation workflow. This change broadens the compatibility of the system with various model storage locations.

Highlights

  • Google Cloud Storage (GS) Model Support: The system now correctly handles model_name_or_path arguments that specify models located directly in Google Cloud Storage (GS) buckets.
  • Automated Tokenizer and Config Caching: For GS-hosted models, the necessary tokenizer.json, tokenizer_config.json, and config.json files are automatically downloaded and cached locally, ensuring dataset preprocessing can proceed without issues.
  • Improved GS Download Utility: The download_from_gs_bucket utility function has been enhanced to support downloading multiple files simultaneously and to automatically create the destination directory if it doesn't exist.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request updates the data preprocessing logic to handle models stored in Google Cloud Storage by downloading necessary tokenizer files locally. The changes in open_instruct/utils.py to generalize download_from_gs_bucket are well-implemented. However, the new logic in mason.py has a critical bug that could lead to an IndexError if the --model_name_or_path argument is provided without a value or as the last argument. I've also included a suggestion to improve code maintainability.

Comment on lines +497 to +498
except ValueError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The try...except block only catches ValueError, but an IndexError can occur on line 480 if --model_name_or_path is the last argument in filtered_command. This would cause the script to crash. It's safer to catch both exceptions to handle this edge case gracefully.

Suggested change
except ValueError:
pass
except (ValueError, IndexError):
pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if model_name_or_path is the last argument then we would want to crash as that's a weird input

Comment on lines +487 to +494
download_from_gs_bucket(
[
f"{model_name_or_path}/tokenizer.json",
f"{model_name_or_path}/tokenizer_config.json",
f"{model_name_or_path}/config.json",
],
local_cache_folder,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of files to download is hardcoded inside the download_from_gs_bucket call. To improve readability and maintainability, consider extracting this list into a named variable before the call. This makes it clearer what files are being downloaded. For example:

tokenizer_files = [
    "tokenizer.json",
    "tokenizer_config.json",
    "config.json",
]
download_from_gs_bucket(
    [f"{model_name_or_path}/{f}" for f in tokenizer_files],
    local_cache_folder,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh, I think mine is fine as is

src_path,
dest_path,
]
if not isinstance(src_paths, list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to only take a list and force the callers to pass one in?

]
if not isinstance(src_paths, list):
src_paths = [src_paths]
cmd.extend(src_paths)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests here? Let's mock live_subprocess_output so we capture the cmd it's called with and verify it against some known correct values. Should be a one prompt change with Codex


filtered_command[model_name_idx] = local_cache_folder
except ValueError:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rewrite this to make these changes:

  1. add a comment saying when we get a ValueError
  2. Can you change this to be a function, and return early when a condition is not true? then we can have less nesting, which will make it easier to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants