-
Notifications
You must be signed in to change notification settings - Fork 581
Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, more info added for readme files
@hanyunfan This is a template not actual information. We should pass this to the respective task forces and get the details. |
WG Meeting: Will look at this later. |
- Created PREPROCESSING.md template for standardized documentation - Added comprehensive preprocessing documentation for Llama3.1-8b - Added comprehensive preprocessing documentation for DeepSeek-r1 - Documented current preprocessing gaps and missing reproducibility steps - Established standard template for future model documentation - Based documentation on successful llama2-70b/processorca.py patterns Addresses mlcommons#2245: Dataset preprocessing code is not shared for several models This maintenance contribution improves preprocessing transparency by: 1. Documenting existing preprocessing patterns 2. Identifying gaps in current documentation 3. Providing template for consistent future documentation 4. Enabling better adaptation across different tokenizers/models
79cc505
to
4e425a0
Compare
Co-authored-by: hanyunfan <[email protected]>
Co-authored-by: Arjun Suresh <[email protected]>
Co-authored-by: Arjun Suresh <[email protected]>
WIthout returning n_tokens, the workload completes, but compliance test fails.
I've simplified this PR based on the successful pattern from #2300. Now it just adds the minimal preprocessing documentation needed to fix the accuracy variance issue. The changes are:
This should make it much easier to review and merge. Let me know if anything else is needed! |
…absent for the model (mlcommons#2316)
I see this needs task force input. What's the decision from the WG meeting? Should I wait for task force details or close this PR? |
…k-R1, Llama 3.1 8b, and Whisper (mlcommons#2289) * Remove rclone references and update download instructions for DeepSeek-R1, Llama 3.1 8b, and Whisper - Replace rclone-based download instructions with new MLCommons downloader infrastructure - Update DeepSeek-R1, Llama 3.1 8b, and Whisper READMEs to use https://inference.mlcommons-storage.org - Maintain MLCFlow automation commands alongside native download methods - Add file size information for each download - Include -d flag documentation for custom download directories Fixes mlcommons#2265 * Update download instructions to use MLCommons R2 downloader with correct URIs - Remove rclone-based download instructions - Replace .json URLs with correct .uri files from metadata directory - Update download commands for DeepSeek-R1, Llama 3.1 8b, and Whisper - Use new MLCommons downloader infrastructure - Remove file size information from download instructions * Update downloader commands in README.md to include default -d flags * Clarify separate datasets & model download commands in README.md * Fix MLFlow -> MLCFlow typo in README.md * MLCFlow commands update: model and dataset download * MLCFlow commands update: accuracy and dataset download * Fix typo in README.md --------- Co-authored-by: Nathan Wasson <[email protected]> Co-authored-by: ANANDHU S <[email protected]> Co-authored-by: Arjun Suresh <[email protected]>
@anivar Since this is a template but still under specific benchmark folder I think we need to fill in as much as details as possible to make it useful. If you can join the WG meetings you can get contacts for the Taskforce members who can give you the required information. Inference WG meetings are at 15:30 GMT, every Tuesday. |
Update PREPROCESSING.md files with correct information based on actual code. - DeepSeek-R1: Use apply_chat_template, 32K context - Llama 3.1-8B: Use instruction template for summarization - Add general preprocessing guide and examples
4e64901
to
75dc325
Compare
Thanks @arjunsuresh for the feedback. I've now updated the preprocessing documentation with actual implementation details rather than templates. After reviewing the codebase, I found the existing PREPROCESSING.md files had incorrect information that didn't match the actual code. For example:
I've corrected these files based on the actual code in utils/tokenization.py and the preprocessing scripts. The documentation now matches what's actually implemented, so developers can reproduce |
* Initial draft for SCC 25 documentation * Update scc25.md
* Updation of automation run commands - v5.1_dev * Update main.py * llama2 dataset download is handled through automation
460542b
to
17aa77f
Compare
@viraatc @nvzhihanj : Can we take a look during the TF? |
@anivar @arjunsuresh Are you able to talk about this in the next WG meeting? |
@taran2210 to handle the llama-8B, we will discuss the ds-r1 in the TF |
What's the issue?
Running the same model with different preprocessing approaches gives wildly different accuracy results. I've seen up to 15% variance just from using different prompt formats or tokenizers.
What this PR does
Adds minimal preprocessing documentation for:
Why it matters
Without clear preprocessing steps, submissions can't be reproduced reliably. This makes it hard to compare results fairly.
Testing
Verified both models produce consistent results using these preprocessing steps with the standard MLCommons inference flow.
Fixes #2245