pass in `model_name_or_path` that is only augusta and it works #1183

mnoukhov · 2025-11-12T21:43:06Z

copies down the tokenizer, tokenizer_config, and config to weka in order to do dataset preprocessing

Note

When --model_name_or_path is a gs:// path, download tokenizer/config files to a local cache for preprocessing, and allow download_from_gs_bucket to accept multiple sources.

Open Instruct training (mason.py):
- Detect --model_name_or_path with gs:// and download tokenizer.json, tokenizer_config.json, and config.json to a local cache ({auto_output_dir_path}/{whoami}/tokenizer_<hash>/), then substitute this path for dataset preprocessing.
Utilities (open_instruct/utils.py):
- Update download_from_gs_bucket to accept a single path or list of paths and create the destination directory before copy.

^{Written by Cursor Bugbot for commit 98e7976. This will update automatically on new commits. Configure here.}

gemini-code-assist · 2025-11-12T21:43:24Z

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial functionality to support dataset preprocessing for models stored exclusively on Google Cloud Storage. It ensures that when a model path points to a GS bucket, the required tokenizer and configuration files are fetched and cached locally, thereby integrating GS-hosted models seamlessly into the existing data preparation workflow. This change broadens the compatibility of the system with various model storage locations.

Highlights

Google Cloud Storage (GS) Model Support: The system now correctly handles model_name_or_path arguments that specify models located directly in Google Cloud Storage (GS) buckets.
Automated Tokenizer and Config Caching: For GS-hosted models, the necessary tokenizer.json, tokenizer_config.json, and config.json files are automatically downloaded and cached locally, ensuring dataset preprocessing can proceed without issues.
Improved GS Download Utility: The download_from_gs_bucket utility function has been enhanced to support downloading multiple files simultaneously and to automatically create the destination directory if it doesn't exist.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request updates the data preprocessing logic to handle models stored in Google Cloud Storage by downloading necessary tokenizer files locally. The changes in open_instruct/utils.py to generalize download_from_gs_bucket are well-implemented. However, the new logic in mason.py has a critical bug that could lead to an IndexError if the --model_name_or_path argument is provided without a value or as the last argument. I've also included a suggestion to improve code maintainability.

gemini-code-assist · 2025-11-12T21:44:47Z

mason.py

+                except ValueError:
+                    pass


The try...except block only catches ValueError, but an IndexError can occur on line 480 if --model_name_or_path is the last argument in filtered_command. This would cause the script to crash. It's safer to catch both exceptions to handle this edge case gracefully.

Suggested change

except ValueError:

pass

except (ValueError, IndexError):

pass

if model_name_or_path is the last argument then we would want to crash as that's a weird input

gemini-code-assist · 2025-11-12T21:44:48Z

mason.py

+                            download_from_gs_bucket(
+                                [
+                                    f"{model_name_or_path}/tokenizer.json",
+                                    f"{model_name_or_path}/tokenizer_config.json",
+                                    f"{model_name_or_path}/config.json",
+                                ],
+                                local_cache_folder,
+                            )


The list of files to download is hardcoded inside the download_from_gs_bucket call. To improve readability and maintainability, consider extracting this list into a named variable before the call. This makes it clearer what files are being downloaded. For example:

tokenizer_files = [ "tokenizer.json", "tokenizer_config.json", "config.json", ] download_from_gs_bucket( [f"{model_name_or_path}/{f}" for f in tokenizer_files], local_cache_folder, )

ehh, I think mine is fine as is

mason.py

finbarrtimbers · 2025-11-20T21:34:56Z

open_instruct/utils.py

-        src_path,
-        dest_path,
    ]
+    if not isinstance(src_paths, list):


Let's change this to only take a list and force the callers to pass one in?

finbarrtimbers · 2025-11-20T21:38:56Z

open_instruct/utils.py

    ]
+    if not isinstance(src_paths, list):
+        src_paths = [src_paths]
+    cmd.extend(src_paths)


Can you add some tests here? Let's mock live_subprocess_output so we capture the cmd it's called with and verify it against some known correct values. Should be a one prompt change with Codex

finbarrtimbers · 2025-11-20T21:40:12Z

mason.py

+
+                        filtered_command[model_name_idx] = local_cache_folder
+                except ValueError:
+                    pass


Can you rewrite this to make these changes:

add a comment saying when we get a ValueError

Can you change this to be a function, and return early when a condition is not true? then we can have less nesting, which will make it easier to follow.

pass in model_name_or_path that is on augusta and it works

98e7976

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

cursor bot reviewed Nov 12, 2025

View reviewed changes

mason.py Show resolved Hide resolved

mnoukhov requested a review from finbarrtimbers November 20, 2025 21:18

finbarrtimbers requested changes Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pass in `model_name_or_path` that is only augusta and it works #1183

pass in `model_name_or_path` that is only augusta and it works #1183

Uh oh!

mnoukhov commented Nov 12, 2025 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

mnoukhov Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

mnoukhov Nov 12, 2025

Uh oh!

Uh oh!

finbarrtimbers Nov 20, 2025

Uh oh!

finbarrtimbers Nov 20, 2025

Uh oh!

finbarrtimbers Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pass in model_name_or_path that is only augusta and it works #1183

Are you sure you want to change the base?

pass in model_name_or_path that is only augusta and it works #1183

Uh oh!

Conversation

mnoukhov commented Nov 12, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

mnoukhov Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

mnoukhov Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

finbarrtimbers Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pass in `model_name_or_path` that is only augusta and it works #1183

pass in `model_name_or_path` that is only augusta and it works #1183

mnoukhov commented Nov 12, 2025 •

edited by cursor bot

Loading