-
Notifications
You must be signed in to change notification settings - Fork 459
pass in model_name_or_path that is only augusta and it works
#1183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @mnoukhov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces crucial functionality to support dataset preprocessing for models stored exclusively on Google Cloud Storage. It ensures that when a model path points to a GS bucket, the required tokenizer and configuration files are fetched and cached locally, thereby integrating GS-hosted models seamlessly into the existing data preparation workflow. This change broadens the compatibility of the system with various model storage locations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request updates the data preprocessing logic to handle models stored in Google Cloud Storage by downloading necessary tokenizer files locally. The changes in open_instruct/utils.py to generalize download_from_gs_bucket are well-implemented. However, the new logic in mason.py has a critical bug that could lead to an IndexError if the --model_name_or_path argument is provided without a value or as the last argument. I've also included a suggestion to improve code maintainability.
| except ValueError: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try...except block only catches ValueError, but an IndexError can occur on line 480 if --model_name_or_path is the last argument in filtered_command. This would cause the script to crash. It's safer to catch both exceptions to handle this edge case gracefully.
| except ValueError: | |
| pass | |
| except (ValueError, IndexError): | |
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if model_name_or_path is the last argument then we would want to crash as that's a weird input
| download_from_gs_bucket( | ||
| [ | ||
| f"{model_name_or_path}/tokenizer.json", | ||
| f"{model_name_or_path}/tokenizer_config.json", | ||
| f"{model_name_or_path}/config.json", | ||
| ], | ||
| local_cache_folder, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of files to download is hardcoded inside the download_from_gs_bucket call. To improve readability and maintainability, consider extracting this list into a named variable before the call. This makes it clearer what files are being downloaded. For example:
tokenizer_files = [
"tokenizer.json",
"tokenizer_config.json",
"config.json",
]
download_from_gs_bucket(
[f"{model_name_or_path}/{f}" for f in tokenizer_files],
local_cache_folder,
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehh, I think mine is fine as is
| src_path, | ||
| dest_path, | ||
| ] | ||
| if not isinstance(src_paths, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this to only take a list and force the callers to pass one in?
| ] | ||
| if not isinstance(src_paths, list): | ||
| src_paths = [src_paths] | ||
| cmd.extend(src_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some tests here? Let's mock live_subprocess_output so we capture the cmd it's called with and verify it against some known correct values. Should be a one prompt change with Codex
|
|
||
| filtered_command[model_name_idx] = local_cache_folder | ||
| except ValueError: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rewrite this to make these changes:
- add a comment saying when we get a ValueError
- Can you change this to be a function, and return early when a condition is not true? then we can have less nesting, which will make it easier to follow.
copies down the tokenizer, tokenizer_config, and config to weka in order to do dataset preprocessing
Note
When
--model_name_or_pathis ags://path, download tokenizer/config files to a local cache for preprocessing, and allowdownload_from_gs_bucketto accept multiple sources.--model_name_or_pathwithgs://and downloadtokenizer.json,tokenizer_config.json, andconfig.jsonto a local cache ({auto_output_dir_path}/{whoami}/tokenizer_<hash>/), then substitute this path for dataset preprocessing.download_from_gs_bucketto accept a single path or list of paths and create the destination directory before copy.Written by Cursor Bugbot for commit 98e7976. This will update automatically on new commits. Configure here.