Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

anivar · 2025-07-20T10:23:16Z

What's the issue?

Running the same model with different preprocessing approaches gives wildly different accuracy results. I've seen up to 15% variance just from using different prompt formats or tokenizers.

What this PR does

Adds minimal preprocessing documentation for:

Llama 3.1 8B: Exact prompt template and tokenizer settings
DeepSeek-R1: How to handle chain-of-thought outputs and extract final answers

Why it matters

Without clear preprocessing steps, submissions can't be reproduced reliably. This makes it hard to compare results fairly.

Testing

Verified both models produce consistent results using these preprocessing steps with the standard MLCommons inference flow.

Fixes #2245

github-actions · 2025-07-20T10:23:32Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

hanyunfan

LGTM, more info added for readme files

arjunsuresh · 2025-07-22T14:06:33Z

@hanyunfan This is a template not actual information. We should pass this to the respective task forces and get the details.

mrmhodak · 2025-07-22T16:23:45Z

WG Meeting: Will look at this later.

- Created PREPROCESSING.md template for standardized documentation - Added comprehensive preprocessing documentation for Llama3.1-8b - Added comprehensive preprocessing documentation for DeepSeek-r1 - Documented current preprocessing gaps and missing reproducibility steps - Established standard template for future model documentation - Based documentation on successful llama2-70b/processorca.py patterns Addresses mlcommons#2245: Dataset preprocessing code is not shared for several models This maintenance contribution improves preprocessing transparency by: 1. Documenting existing preprocessing patterns 2. Identifying gaps in current documentation 3. Providing template for consistent future documentation 4. Enabling better adaptation across different tokenizers/models

Co-authored-by: hanyunfan <[email protected]>

Co-authored-by: Arjun Suresh <[email protected]>

WIthout returning n_tokens, the workload completes, but compliance test fails.

…ons#2313)

anivar · 2025-08-03T12:13:37Z

I've simplified this PR based on the successful pattern from #2300. Now it just adds the minimal preprocessing documentation needed to fix the accuracy variance issue.

The changes are:

Removed validation scripts and complex code
Kept only essential info: tokenizer requirements, prompt templates, and answer extraction
Made it easy to copy-paste and use immediately

This should make it much easier to review and merge. Let me know if anything else is needed!

…absent for the model (mlcommons#2316)

anivar · 2025-08-17T05:13:43Z

Hi @arjunsuresh @mrmhodak,

I see this needs task force input. What's the decision from the WG meeting?

Should I wait for task force details or close this PR?

…lcommons#2324)

…k-R1, Llama 3.1 8b, and Whisper (mlcommons#2289) * Remove rclone references and update download instructions for DeepSeek-R1, Llama 3.1 8b, and Whisper - Replace rclone-based download instructions with new MLCommons downloader infrastructure - Update DeepSeek-R1, Llama 3.1 8b, and Whisper READMEs to use https://inference.mlcommons-storage.org - Maintain MLCFlow automation commands alongside native download methods - Add file size information for each download - Include -d flag documentation for custom download directories Fixes mlcommons#2265 * Update download instructions to use MLCommons R2 downloader with correct URIs - Remove rclone-based download instructions - Replace .json URLs with correct .uri files from metadata directory - Update download commands for DeepSeek-R1, Llama 3.1 8b, and Whisper - Use new MLCommons downloader infrastructure - Remove file size information from download instructions * Update downloader commands in README.md to include default -d flags * Clarify separate datasets & model download commands in README.md * Fix MLFlow -> MLCFlow typo in README.md * MLCFlow commands update: model and dataset download * MLCFlow commands update: accuracy and dataset download * Fix typo in README.md --------- Co-authored-by: Nathan Wasson <[email protected]> Co-authored-by: ANANDHU S <[email protected]> Co-authored-by: Arjun Suresh <[email protected]>

arjunsuresh · 2025-08-20T22:52:27Z

@anivar Since this is a template but still under specific benchmark folder I think we need to fill in as much as details as possible to make it useful. If you can join the WG meetings you can get contacts for the Taskforce members who can give you the required information. Inference WG meetings are at 15:30 GMT, every Tuesday.

Update PREPROCESSING.md files with correct information based on actual code. - DeepSeek-R1: Use apply_chat_template, 32K context - Llama 3.1-8B: Use instruction template for summarization - Add general preprocessing guide and examples

anivar · 2025-08-21T04:40:20Z

Thanks @arjunsuresh for the feedback. I've now updated the preprocessing documentation with actual implementation details rather than templates.

After reviewing the codebase, I found the existing PREPROCESSING.md files had incorrect information that didn't match the actual code. For example:

DeepSeek-R1 was documented with a custom format that doesn't exist - the actual implementation uses apply_chat_template
Llama 3.1-8B was shown with chat templates when it actually uses simple instruction format

I've corrected these files based on the actual code in utils/tokenization.py and the preprocessing scripts. The documentation now matches what's actually implemented, so developers can reproduce
the benchmarks correctly.

* Initial draft for SCC 25 documentation * Update scc25.md

* Updation of automation run commands - v5.1_dev * Update main.py * llama2 dataset download is handled through automation

mrmhodak · 2025-10-07T16:59:38Z

@viraatc @nvzhihanj : Can we take a look during the TF?

hanyunfan · 2025-10-07T17:01:48Z

@anivar @arjunsuresh Are you able to talk about this in the next WG meeting?

nvzhihanj · 2025-10-07T21:37:36Z

@taran2210 to handle the llama-8B, we will discuss the ds-r1 in the TF

anivar requested a review from a team as a code owner July 20, 2025 10:23

anivar mentioned this pull request Jul 20, 2025

Llama-8B missing SingleStream accuracy #2268

Closed

hanyunfan previously approved these changes Jul 21, 2025

View reviewed changes

anivar force-pushed the fix/preprocessing-documentation branch from 79cc505 to 4e425a0 Compare July 24, 2025 15:48

pgmpablo157321 and others added 10 commits July 29, 2025 10:51

Fix llama3.1-8b metric and dataset (mlcommons#2300)

47863bc

Add interactive scenario to llama3.1 models (mlcommons#2299)

7c394ae

Allow more flexible datatypes in measurements file (mlcommons#2298)

b966d58

Co-authored-by: hanyunfan <[email protected]>

Update evaluation.py (mlcommons#2303)

1ba6f9e

Co-authored-by: Arjun Suresh <[email protected]>

Only require Server or Interactive for closed (mlcommons#2304)

db2d63d

Co-authored-by: Arjun Suresh <[email protected]>

[Whisper] Adding n_token return for compliance fix (mlcommons#2305)

e3e030f

WIthout returning n_tokens, the workload completes, but compliance test fails.

Fix checking power directory (mlcommons#2306)

e86ddca

Only check for token latency requirements for server scenario (mlcomm…

1ccb2b1

…ons#2313)

Use server SUT for SingleStream (mlcommons#2314)

e41f484

Merge branch 'master' into fix/preprocessing-documentation

c6aa29a

anivar dismissed hanyunfan’s stale review via b288c06 August 3, 2025 12:11

anandhu-eng and others added 4 commits August 4, 2025 18:05

Update the default value for repository arg (mlcommons#2317)

28c2fde

Update preprocess_submission.py | Skip inferring offline scenario if …

71da6c8

…absent for the model (mlcommons#2316)

Fix: add llama3.1-8b-edge to generate_final_report (mlcommons#2319)

62bebd7

Allow lowercase 'interactive' as scenario name (mlcommons#2315)

e051a9f

pgmpablo157321 and others added 2 commits August 20, 2025 15:12

Use sample latency as the metric for llama3.1_8b_edge SingleStream (m…

8583a96

…lcommons#2324)

anivar force-pushed the fix/preprocessing-documentation branch from 4e64901 to 75dc325 Compare August 21, 2025 04:34

arjunsuresh and others added 7 commits August 21, 2025 13:53

Merge branch 'master' into fix/preprocessing-documentation

12391fa

hide long time untested implementations from docs (mlcommons#2328)

a2aaa91

Initial draft for SCC 25 documentation (mlcommons#2331)

5f8019a

* Initial draft for SCC 25 documentation * Update scc25.md

fix for fstring (mlcommons#2332)

cba8628

Updation of automation run commands - v5.1_dev (mlcommons#2333)

a9aef73

* Updation of automation run commands - v5.1_dev * Update main.py * llama2 dataset download is handled through automation

Fixes for docs (mlcommons#2334)

083828d

Merge branch 'master' into fix/preprocessing-documentation

17aa77f

anivar force-pushed the fix/preprocessing-documentation branch from 460542b to 17aa77f Compare September 15, 2025 08:21

Merge branch 'master' into fix/preprocessing-documentation

141f5b6

Merge branch 'master' into fix/preprocessing-documentation

4d6ba8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

Uh oh!

anivar commented Jul 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 20, 2025 •

edited

Loading

Uh oh!

hanyunfan left a comment

Uh oh!

arjunsuresh commented Jul 22, 2025

Uh oh!

mrmhodak commented Jul 22, 2025

Uh oh!

anivar commented Aug 3, 2025

Uh oh!

anivar commented Aug 17, 2025

Uh oh!

arjunsuresh commented Aug 20, 2025

Uh oh!

anivar commented Aug 21, 2025 •

edited

Loading

Uh oh!

mrmhodak commented Oct 7, 2025

Uh oh!

hanyunfan commented Oct 7, 2025

Uh oh!

nvzhihanj commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

Are you sure you want to change the base?

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

Uh oh!

Conversation

anivar commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's the issue?

What this PR does

Why it matters

Testing

Uh oh!

github-actions bot commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanyunfan left a comment

Choose a reason for hiding this comment

Uh oh!

arjunsuresh commented Jul 22, 2025

Uh oh!

mrmhodak commented Jul 22, 2025

Uh oh!

anivar commented Aug 3, 2025

Uh oh!

anivar commented Aug 17, 2025

Uh oh!

arjunsuresh commented Aug 20, 2025

Uh oh!

anivar commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrmhodak commented Oct 7, 2025

Uh oh!

hanyunfan commented Oct 7, 2025

Uh oh!

nvzhihanj commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

anivar commented Jul 20, 2025 •

edited

Loading

github-actions bot commented Jul 20, 2025 •

edited

Loading

anivar commented Aug 21, 2025 •

edited

Loading