From d02b7cbd795465b50a4a6ab415865db6d1c77425 Mon Sep 17 00:00:00 2001
From: Anivar A Aravind <anivar@weavez.in>
Date: Sun, 20 Jul 2025 15:51:36 +0530
Subject: [PATCH 01/22] Add preprocessing documentation for DeepSeek-r1 and
 Llama3.1-8b

- Created PREPROCESSING.md template for standardized documentation
- Added comprehensive preprocessing documentation for Llama3.1-8b
- Added comprehensive preprocessing documentation for DeepSeek-r1
- Documented current preprocessing gaps and missing reproducibility steps
- Established standard template for future model documentation
- Based documentation on successful llama2-70b/processorca.py patterns

Addresses #2245: Dataset preprocessing code is not shared for several models

This maintenance contribution improves preprocessing transparency by:
1. Documenting existing preprocessing patterns
2. Identifying gaps in current documentation
3. Providing template for consistent future documentation
4. Enabling better adaptation across different tokenizers/models
---
 PREPROCESSING-TEMPLATE.md             | 127 ++++++++++++++++++++++++++
 language/deepseek-r1/PREPROCESSING.md | 113 +++++++++++++++++++++++
 language/llama3.1-8b/PREPROCESSING.md |  82 +++++++++++++++++
 3 files changed, 322 insertions(+)
 create mode 100644 PREPROCESSING-TEMPLATE.md
 create mode 100644 language/deepseek-r1/PREPROCESSING.md
 create mode 100644 language/llama3.1-8b/PREPROCESSING.md

diff --git a/PREPROCESSING-TEMPLATE.md b/PREPROCESSING-TEMPLATE.md
new file mode 100644
index 0000000000..57a28bd80e
--- /dev/null
+++ b/PREPROCESSING-TEMPLATE.md
@@ -0,0 +1,127 @@
+# Dataset Preprocessing Documentation Template
+
+## Purpose
+This template provides a standardized way to document dataset preprocessing steps for MLCommons inference benchmarks, ensuring reproducibility and transparency.
+
+## Template Structure
+
+### Model: [MODEL_NAME]
+**Dataset:** [DATASET_NAME]  
+**Evaluation Task:** [TASK_DESCRIPTION]
+
+#### Data Source
+- **Raw Dataset:** [SOURCE_AND_FORMAT]
+- **Download Method:** [HOW_TO_OBTAIN]
+- **License:** [LICENSE_INFO]
+
+#### Preprocessing Pipeline
+
+##### 1. Tokenization
+```python
+# Example based on llama2-70b/processorca.py pattern
+from transformers import [TOKENIZER_CLASS]
+tokenizer = [TOKENIZER_CLASS].from_pretrained(model_dir)
+tokens = tokenizer(text)["input_ids"]
+```
+
+##### 2. Filtering Steps
+- **Language Filter:** [DESCRIPTION]
+- **Length Filter:** [SEQUENCE_LENGTH_LIMITS]
+- **Quality Filter:** [QUALITY_CRITERIA]
+- **Content Filter:** [CONTENT_RESTRICTIONS]
+
+##### 3. Formatting
+- **Input Format:** [INPUT_TEMPLATE]
+- **Output Format:** [OUTPUT_TEMPLATE]
+- **Special Tokens:** [SPECIAL_TOKEN_HANDLING]
+
+##### 4. Sampling Strategy
+- **Total Samples:** [NUMBER]
+- **Sampling Method:** [RANDOM/STRATIFIED/OTHER]
+- **Validation Split:** [IF_APPLICABLE]
+
+#### Adaptation Guide
+**For Different Tokenizers:**
+- Modify tokenizer initialization
+- Adjust sequence length limits
+- Update special token handling
+
+**For Different Models:**
+- Update input/output templates
+- Adjust filtering criteria
+- Modify prompt formatting
+
+#### Files Generated
+- **Main Dataset:** [FILENAME_AND_FORMAT]
+- **Calibration Set:** [FILENAME_AND_FORMAT]
+- **Metadata:** [FILENAME_AND_FORMAT]
+
+#### Verification
+- **Expected Sample Count:** [NUMBER]
+- **Checksum/Hash:** [IF_AVAILABLE]
+- **Quality Metrics:** [ROUGE/BLEU/OTHER]
+
+---
+
+## Example Applications
+
+### Llama3.1-8b (CNN/DailyMail)
+**Dataset:** CNN/DailyMail 3.0.0  
+**Evaluation Task:** Text Summarization
+
+#### Data Source
+- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
+- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0")`
+- **License:** Apache 2.0
+
+#### Preprocessing Pipeline
+##### 1. Tokenization
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+##### 2. Formatting
+- **Input Template:** 
+```
+Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
+
+Article:
+{article}
+
+Summary:
+```
+
+##### 3. Current Gaps
+- ❌ No documented filtering steps
+- ❌ No sampling strategy explanation  
+- ❌ No quality control measures
+- ❌ No reproducible preprocessing script
+
+### DeepSeek-r1 (Multi-domain Evaluation)
+**Dataset:** Ensemble of AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench  
+**Evaluation Task:** Multi-domain Reasoning
+
+#### Data Source
+- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2
+- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
+- **License:** Various (CC0, MIT, CC BY 4.0)
+
+#### Current Gaps
+- ❌ No documented preprocessing steps
+- ❌ No tokenization details
+- ❌ No filtering or sampling explanation
+- ❌ No adaptation guide for other models
+- ❌ Cannot reproduce from raw sources
+
+---
+
+## Implementation Recommendation
+
+1. **For each model directory**, add `PREPROCESSING.md` following this template
+2. **For models with preprocessing scripts**, document the steps in the README
+3. **For models using preprocessed data**, provide original preprocessing methodology
+4. **Create common utilities** for preprocessing patterns that can be shared across models
\ No newline at end of file
diff --git a/language/deepseek-r1/PREPROCESSING.md b/language/deepseek-r1/PREPROCESSING.md
new file mode 100644
index 0000000000..9f813197bb
--- /dev/null
+++ b/language/deepseek-r1/PREPROCESSING.md
@@ -0,0 +1,113 @@
+# Dataset Preprocessing Documentation - DeepSeek-R1
+
+## Model: DeepSeek-R1
+**Dataset:** Multi-domain Evaluation Ensemble  
+**Evaluation Task:** Multi-domain Reasoning and Code Generation
+
+## Data Source
+- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
+- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
+- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
+- **Licenses:** 
+  - AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
+  - MATH500: [MIT](https://opensource.org/license/mit)
+  - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
+  - MMLU-Pro: [MIT](https://opensource.org/license/mit)
+  - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
+
+## Current Implementation
+
+### Files Available
+- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
+- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
+- **Format:** Preprocessed pickle files ready for evaluation
+
+### Download Process
+```bash
+# Install Rclone
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
+
+# Configure access
+rclone config create mlc-inference s3 provider=Cloudflare \
+  access_key_id=f65ba5eef400db161ea49967de89f47b \
+  secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
+  endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+
+# Download datasets
+rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
+rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
+```
+
+## Missing Documentation (Addresses Issue #2245)
+
+The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
+
+### 1. Original Data Sources
+- **Raw Dataset Locations:** Where each component dataset was obtained
+- **Version Information:** Specific versions/commits of source datasets
+- **Access Methods:** How to obtain raw data independently
+
+### 2. Preprocessing Pipeline
+- **Tokenization Method:** Which tokenizer was used and configuration
+- **Input Formatting:** How different dataset formats were standardized
+- **Quality Filtering:** Criteria for sample inclusion/exclusion
+- **Ensemble Strategy:** How multiple datasets were combined
+
+### 3. Dataset Statistics
+- **Sample Counts:** Number of samples from each component dataset
+- **Distribution:** How samples are balanced across domains
+- **Difficulty Levels:** Complexity distribution of included problems
+
+### 4. Validation Process
+- **Quality Control:** How preprocessing quality was verified
+- **Consistency Checks:** Validation of format standardization
+- **Error Handling:** How malformed samples were addressed
+
+## Adaptation Challenges
+
+**For Different Tokenizers:**
+- Cannot modify tokenization without access to raw data
+- No documentation of original tokenization parameters
+- Unable to test preprocessing consistency
+
+**For Different Models:**
+- Cannot adapt input formatting without preprocessing scripts
+- No guidance on prompt template modifications
+- Unable to reproduce dataset with different filtering criteria
+
+## Recommended Improvements
+
+To fully address issue #2245 and improve reproducibility:
+
+### 1. Raw Data Access
+- Provide scripts to download original datasets
+- Document exact versions and sources used
+- Include data licenses and attribution
+
+### 2. Preprocessing Scripts
+- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
+- Document tokenization and formatting steps
+- Include quality filtering logic
+
+### 3. Documentation
+- Add detailed preprocessing methodology
+- Include dataset statistics and composition
+- Provide adaptation guidelines
+
+### 4. Validation
+- Include preprocessing verification scripts
+- Document expected outputs and checksums
+- Provide quality metrics
+
+## Temporary Workaround
+
+Until full preprocessing documentation is available:
+1. Use provided preprocessed datasets for standard evaluation
+2. Contact maintainers for specific adaptation requirements
+3. Reference `llama2-70b/processorca.py` for preprocessing patterns
+4. Consider contributing preprocessing scripts based on reverse engineering
+
+## See Also
+- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
+- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
+- Repository issue #2245 - Discussion of preprocessing documentation gaps
\ No newline at end of file
diff --git a/language/llama3.1-8b/PREPROCESSING.md b/language/llama3.1-8b/PREPROCESSING.md
new file mode 100644
index 0000000000..84f0914fbf
--- /dev/null
+++ b/language/llama3.1-8b/PREPROCESSING.md
@@ -0,0 +1,82 @@
+# Dataset Preprocessing Documentation - Llama3.1-8B
+
+## Model: Llama3.1-8B
+**Dataset:** CNN/DailyMail 3.0.0  
+**Evaluation Task:** Text Summarization
+
+## Data Source
+- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
+- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")`
+- **License:** Apache 2.0
+- **Download Script:** `download_cnndm.py`
+
+## Preprocessing Pipeline
+
+### 1. Tokenization
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+### 2. Input Template
+```
+Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
+
+Article:
+{article}
+
+Summary:
+```
+
+### 3. Current Implementation
+- **Download:** `download_cnndm.py` loads CNN/DailyMail dataset
+- **Calibration:** `prepare-calibration.py` creates calibration subset
+- **Evaluation:** Uses `evaluation.py` for accuracy assessment
+
+## Missing Documentation (Addresses Issue #2245)
+
+The following preprocessing steps are **not currently documented** but would be needed for full reproducibility:
+
+### 4. Filtering Steps (Recommended)
+Based on `llama2-70b/processorca.py` patterns:
+- **Language Filter:** English-only content validation
+- **Length Filter:** Input/output sequence length limits
+- **Quality Filter:** Remove very short summaries
+- **Content Filter:** Handle special characters and formatting
+
+### 5. Sampling Strategy (Recommended)
+- **Dataset Size:** Specify number of evaluation samples
+- **Selection Method:** Random vs stratified sampling
+- **Validation:** How to verify preprocessing consistency
+
+## Adaptation Guide
+
+**For Different Tokenizers:**
+1. Update `model-id` parameter in scripts
+2. Adjust `model_max_length` based on tokenizer capabilities
+3. Verify special token handling (pad_token, eos_token)
+
+**For Different Models:**
+1. Modify input template format
+2. Adjust summary length requirements (currently 128 tokens)
+3. Update evaluation criteria as needed
+
+## Files Generated
+- **Main Dataset:** Downloaded via `download_cnndm.py`
+- **Calibration Set:** Generated via `prepare-calibration.py`
+- **Format:** Standard CNN/DailyMail format from Hugging Face
+
+## Next Steps for Full Reproducibility
+
+To fully address issue #2245, consider adding:
+1. Complete preprocessing script (similar to `llama2-70b/processorca.py`)
+2. Documentation of filtering criteria
+3. Sampling methodology
+4. Quality validation steps
+
+## See Also
+- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
+- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
\ No newline at end of file

From 47863bc1217f88d27164acc742baa2e3f840f22b Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Tue, 29 Jul 2025 10:51:54 -0500
Subject: [PATCH 02/22] Fix llama3.1-8b metric and dataset (#2300)

---
 language/llama3.1-8b/README.md         | 54 +++++++++++++++++++++++---
 language/llama3.1-8b/download_cnndm.py | 49 ++++++++++++-----------
 tools/submission/submission_checker.py | 12 +++---
 3 files changed, 80 insertions(+), 35 deletions(-)

diff --git a/language/llama3.1-8b/README.md b/language/llama3.1-8b/README.md
index a7a5ad3503..9dd571411b 100644
--- a/language/llama3.1-8b/README.md
+++ b/language/llama3.1-8b/README.md
@@ -171,7 +171,7 @@ mlcr get,dataset,cnndm,_validation,_edge,_llama3,_mlc,_rclone --outdirname=<path
 
 **Native method**
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/sample_cnn_eval_5000.json ./ -P
+rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_eval_5000.json ./ -P
 ```
 
 #### Calibration
@@ -200,7 +200,7 @@ rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/cnn_eval.jso
 python -u main.py --scenario Offline \
                 --model-path ${CHECKPOINT_PATH} \
                 --batch-size 16 \
-                --dtype float16 \
+                --dtype bfloat16 \
                 --user-conf user.conf \
                 --total-sample-count 13368 \
                 --dataset-path ${DATASET_PATH} \
@@ -215,7 +215,7 @@ python -u main.py --scenario Offline \
 python -u main.py --scenario Server \
                 --model-path ${CHECKPOINT_PATH} \
                 --batch-size 16 \
-                --dtype float16 \
+                --dtype bfloat16 \
                 --user-conf user.conf \
                 --total-sample-count 13368 \
                 --dataset-path ${DATASET_PATH} \
@@ -238,7 +238,7 @@ python -u main.py --scenario Offline \
                 --model-path ${CHECKPOINT_PATH} \
                 --batch-size 16 \
                 --accuracy \
-                --dtype float16 \
+                --dtype bfloat16 \
                 --user-conf user.conf \
                 --total-sample-count 13368 \
                 --dataset-path ${DATASET_PATH} \
@@ -265,7 +265,7 @@ python -u main.py --scenario Server \
                 --model-path ${CHECKPOINT_PATH} \
                 --batch-size 16 \
                 --accuracy \
-                --dtype float16 \
+                --dtype bfloat16 \
                 --user-conf user.conf \
                 --total-sample-count 13368 \
                 --dataset-path ${DATASET_PATH} \
@@ -282,6 +282,34 @@ fi
 
 The ServerSUT was not tested for GPU runs.
 
+### Edge
+```
+OUTPUT_LOG_DIR=offline-accuracy-logs
+
+mkdir -p "run_outputs"  # The script will dump all the outputs to 'run_outputs'.
+
+python -u main.py --lg-model-name llama3_1-8b-edge \       
+                --scenario Offline \
+                --model-path ${CHECKPOINT_PATH} \
+                --batch-size 16 \
+                --accuracy \
+                --dtype bfloat16 \
+                --user-conf user.conf \
+                --total-sample-count 13368 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+
+
+ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
+if [ -e ${ACCURACY_LOG_FILE} ]; then
+        python evaluation.py --mlperf-accuracy-file ${ACCURACY_LOG_FILE} \
+                --dataset-file ${DATASET_PATH} --dtype int32
+fi
+```
+
+
 ### Evaluate the accuracy using MLCFlow
 You can also evaulate the accuracy from the generated accuracy log by using the following MLC command
 
@@ -298,7 +326,8 @@ mlcr run,accuracy,mlperf,_cnndm_llama_3,_datacenter --result_dir=<Path to direct
 ```
 
 ## Accuracy Target
-Running the GPU implementation in FP16 precision resulted in the following FP16 accuracy targets:
+### Datacenter
+Running the GPU implementation in BF16 precision resulted in the following BF16 accuracy targets:
 ```
 {
         'rouge1': 38.7792,
@@ -310,3 +339,16 @@ Running the GPU implementation in FP16 precision resulted in the following FP16
 }
 ```
 The accuracy target is 99% for rouge1, rouge2, rougeL and rougeLsum, and 90% for gen_len
+
+### Edge
+Running the GPU implementation in BF16 precision resulted in the following BF16 accuracy targets:
+```
+{
+        'rouge1': 39.06,
+        'rouge2': 16.1147,
+        'rougeL': 24.6375,
+        'rougeLsum': 36.124,
+        'gen_len': 3051113,
+        'gen_num': 5000,
+}
+```
diff --git a/language/llama3.1-8b/download_cnndm.py b/language/llama3.1-8b/download_cnndm.py
index 5737eee37e..d8694be720 100644
--- a/language/llama3.1-8b/download_cnndm.py
+++ b/language/llama3.1-8b/download_cnndm.py
@@ -69,13 +69,23 @@ def get_args():
     os.makedirs(save_dataset_path)
 
 # Load dataset from the hub
-dataset = load_dataset(dataset_id, name=dataset_config)
+dataset = load_dataset(dataset_id, name=dataset_config, split="validation")
 # Load tokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 tokenizer.padding_side = "left"
 tokenizer.pad_token = tokenizer.eos_token
 tokenizer.model_max_length = 8000
 
+print(f"Dshape: {dataset.shape}; type(dataset)")
+ind = set(range(dataset.shape[0]))
+if n_samples:
+    import random
+    random.seed(42)
+    dataset = dataset.shuffle(seed=42)
+    dataset = dataset.flatten_indices()
+    dataset = dataset.take(n_samples)
+    ind = set(random.sample(range(0, 13368), n_samples))
+
 
 instruction_template = {
     "llama": (
@@ -90,41 +100,34 @@ def preprocess_function(sample, padding="max_length"):
     # create list of samples
     inputs = []
 
-    if n_samples:
-        import random
-        random.seed(42)
-        ind = random.sample(range(0, 13368), n_samples)
-    else:
-        ind = list(range(0, len(sample[text_column])))
-
-    for i in range(0, len(sample[text_column])):
-        if i in ind:
-            x = dict()
-            x["instruction"] = instruction_template
-            x["input"] = sample[text_column][i]
-            x["tok_input"] = tokenizer.encode(
-                instruction_template[instruction].format_map(x))
-            x["output"] = sample[summary_column][i]
-            inputs.append(x)
+    #print(f"Num samples: {len(sample[text_column])}")
+    #for i in range(0, len(sample[text_column])):
+    x = dict()
+    x["instruction"] = instruction_template
+    x["input"] = sample[text_column]
+    x["tok_input"] = tokenizer.encode(
+        instruction_template[instruction].format_map(x)
+    )
+    x["output"] = sample[summary_column]
+    #inputs.append(x)
     model_inputs = dict()
-    model_inputs["text"] = inputs
+    model_inputs["text"] = x
 
     return model_inputs
 
 
 # process dataset
-tokenized_dataset = dataset.map(
-    preprocess_function, batched=True, remove_columns=list(dataset["train"].features)
-)
+tokenized_dataset = dataset.map(preprocess_function, batched=False)
 
 # save dataset to disk
 if n_samples is None:
     file = "cnn_eval.json"
 else:
-    file = f"sample_cnn_eval_{n_samples}.json"
+    file = f"cnn_eval_{n_samples}.json"
 
+print(f"Num of tokenized dataset: {len(tokenized_dataset['text'])}")
 with open(os.path.join(save_dataset_path, file), "w") as write_f:
     json.dump(
-        tokenized_dataset["validation"]["text"], write_f, indent=4, ensure_ascii=False
+        tokenized_dataset["text"], write_f, indent=4, ensure_ascii=False
     )
 print("Dataset saved in ", save_dataset_path)
diff --git a/tools/submission/submission_checker.py b/tools/submission/submission_checker.py
index a9e5121dcb..94c84b8218 100755
--- a/tools/submission/submission_checker.py
+++ b/tools/submission/submission_checker.py
@@ -437,15 +437,15 @@
             ),
             "llama3.1-8b-edge": (
                 "ROUGE1",
-                38.7792 * 0.99,
+                39.06 * 0.99,
                 "ROUGE2",
-                15.9075 * 0.99,
+                16.1147 * 0.99,
                 "ROUGEL",
-                24.4957 * 0.99,
+                24.6375 * 0.99,
                 "ROUGELSUM",
-                35.793 * 0.99,
+                36.124 * 0.99,
                 "GEN_LEN",
-                8167644 * 0.9,
+                3051113 * 0.9,
             ),
             "llama2-70b-99": (
                 "ROUGE1",
@@ -512,7 +512,7 @@
             "mixtral-8x7b": ("TOKENS_PER_SAMPLE", 145.9 * 1.1),
             "llama3.1-405b": ("TOKENS_PER_SAMPLE", 684.68 * 1.1),
             "llama3.1-8b": ("GEN_LEN", 8167644 * 1.1),
-            "llama3.1-8b-edge": ("GEN_LEN", 8167644 * 1.1),
+            "llama3.1-8b-edge": ("GEN_LEN", 3051113 * 1.1),
             "deepseek-r1": ("TOKENS_PER_SAMPLE", 1.1 * 3886.2274)
         },
         "accuracy-delta-perc": {

From 7c394ae4eb02d99bb7c10637ac22921910da6263 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Tue, 29 Jul 2025 11:04:05 -0500
Subject: [PATCH 03/22] Add interactive scenario to llama3.1 models (#2299)

---
 tools/submission/generate_final_report.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/submission/generate_final_report.py b/tools/submission/generate_final_report.py
index 519cbe0450..4c0ac7cb18 100644
--- a/tools/submission/generate_final_report.py
+++ b/tools/submission/generate_final_report.py
@@ -223,9 +223,9 @@ def main():
                 "llama2-70b-99.9": ["Server", "Offline", "Interactive"],
                 "mixtral-8x7b": ["Server", "Offline"],
                 "rgat": ["Offline"],
-                "llama3.1-405b": ["Offline", "Server"],
+                "llama3.1-405b": ["Server", "Offline", "Interactive"],
                 "pointpainting": [],
-                "llama3.1-8b": ["Server", "Offline"],
+                "llama3.1-8b": ["Server", "Offline", "Interactive"],
                 "deepseek-r1": ["Server", "Offline"],
                 "whisper": ["Offline"],
             },

From b966d5806c94ccc1bedfcd20540972e074b3d804 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Tue, 29 Jul 2025 11:58:39 -0500
Subject: [PATCH 04/22] Allow more flexible datatypes in measurements file
 (#2298)

Co-authored-by: hanyunfan <frank.han@dell.com>
---
 tools/submission/submission_checker.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/submission/submission_checker.py b/tools/submission/submission_checker.py
index 94c84b8218..bebe9994bf 100755
--- a/tools/submission/submission_checker.py
+++ b/tools/submission/submission_checker.py
@@ -2185,7 +2185,7 @@ def log_result(
                     inferred,
                     power_metric > 0,
                     unit,
-                    '"' + weight_data_types + '"',
+                    '"' + str(weight_data_types).replace("\"", "'") + '"',
                 )
             )
 

From 1ba6f9e90f889b54a387629e200437824f63d99a Mon Sep 17 00:00:00 2001
From: Taran Iyengar <taran.iyengar@intel.com>
Date: Tue, 29 Jul 2025 10:07:03 -0700
Subject: [PATCH 05/22] Update evaluation.py (#2303)

Co-authored-by: Arjun Suresh <arjun@gateoverflow.com>
---
 language/llama3.1-8b/evaluation.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/language/llama3.1-8b/evaluation.py b/language/llama3.1-8b/evaluation.py
index f4fe08e08a..1b5d27f430 100644
--- a/language/llama3.1-8b/evaluation.py
+++ b/language/llama3.1-8b/evaluation.py
@@ -70,7 +70,7 @@ def main():
 
     tokenizer = AutoTokenizer.from_pretrained(
         model_name,
-        model_max_length=2048,
+        model_max_length=128000,
         padding_side="left",
         use_fast=False,
     )

From db2d63d8e443cef3b1a8546302d5e3f5cab1c9c0 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Tue, 29 Jul 2025 12:13:46 -0500
Subject: [PATCH 06/22] Only require Server or Interactive for closed (#2304)

Co-authored-by: Arjun Suresh <arjun@gateoverflow.com>
---
 tools/submission/submission_checker.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/submission/submission_checker.py b/tools/submission/submission_checker.py
index bebe9994bf..152a64040c 100755
--- a/tools/submission/submission_checker.py
+++ b/tools/submission/submission_checker.py
@@ -2865,7 +2865,7 @@ def log_result(
                                 required_scenarios,
                             )
 
-                    if "Server" in optional_scenarios and "Interactive" in optional_scenarios:
+                    if is_closed_or_network and "Server" in optional_scenarios and "Interactive" in optional_scenarios:
                         results[name] = None
                         log.error(
                             "%s does not have all required scenarios, one of [Server, Interactive] is required",

From e3e030f1240eb91c9d777b716b7ecb1424479661 Mon Sep 17 00:00:00 2001
From: Keith Achorn <keith.achorn@intel.com>
Date: Tue, 29 Jul 2025 21:15:50 -0700
Subject: [PATCH 07/22] [Whisper] Adding n_token return for compliance fix
 (#2305)

WIthout returning n_tokens, the workload completes, but compliance test fails.
---
 speech2text/reference_SUT.py | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/speech2text/reference_SUT.py b/speech2text/reference_SUT.py
index 5b3b6c5038..fcd5385507 100644
--- a/speech2text/reference_SUT.py
+++ b/speech2text/reference_SUT.py
@@ -215,14 +215,15 @@ def process_queries(self):
         for output in outputs:
             request_id = int(output.request_id)
             vllm_text = output.outputs[0].text
-            results.append(vllm_text)
+            results.append((vllm_text, len(output.outputs[0].token_ids)))
             query_ids.append(self.query_idx_mapping[request_id])
             qid.append(self.qid_mapping[request_id])
 
         self.num_samples += len(results)
 
-        for i, result in enumerate(results):
+        for i, result_tuple in enumerate(results):
             # Whisper outputs space in the front and capitalizes things
+            result, n_tokens = result_tuple
             result = result.lower().strip()
             transcript = []
             for s in result:
@@ -233,7 +234,7 @@ def process_queries(self):
             assert len(transcript) == 1
             response_array = array.array('q', transcript[0])
 
-            self.output_queue.put((qid[i], response_array))
+            self.output_queue.put((qid[i], n_tokens, response_array))
             print(f"Finished {qid[i]}")
         return True
 
@@ -330,14 +331,13 @@ def flush_queries(self):
     def response_loadgen(self):
         keep_alive = True
         while keep_alive:
-            result = self.output_queue.get()
-            if result is None:
+            qid, n_tokens, response_array = self.output_queue.get()
+            if qid is None:
                 keep_alive = False
             else:
-                qid, response_array = result
                 bi = response_array.buffer_info()
                 response = lg.QuerySampleResponse(qid, bi[0],
-                                                  bi[1] * response_array.itemsize)
+                                                  bi[1] * response_array.itemsize, n_tokens)
                 lg.QuerySamplesComplete([response])
 
     def stop(self):

From e86ddca11258e0f7504c9a6279eac648892e0262 Mon Sep 17 00:00:00 2001
From: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Date: Wed, 30 Jul 2025 18:15:05 +0530
Subject: [PATCH 08/22] Fix checking power directory (#2306)

---
 tools/submission/preprocess_submission.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/submission/preprocess_submission.py b/tools/submission/preprocess_submission.py
index 59bb3b92b2..3ff95bb484 100644
--- a/tools/submission/preprocess_submission.py
+++ b/tools/submission/preprocess_submission.py
@@ -224,7 +224,7 @@ def clean_invalid_results(args, log_path, config, system_desc, system_json,
                         power_is_valid,
                         power_metric,
                         power_efficiency,
-                    ) = check_power_dir(
+                    ) = checker.check_power_dir(
                         power_path,
                         ranging_path,
                         perf_path,
@@ -234,6 +234,7 @@ def clean_invalid_results(args, log_path, config, system_desc, system_json,
                         config,
                     )
                 except Exception as e:
+                    log.error(e)
                     power_is_valid = False
                 if not power_is_valid:
                     log.warning(

From 1ccb2b18b53556c94dc1ca9ee120da39ae1fe195 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Thu, 31 Jul 2025 17:03:47 -0500
Subject: [PATCH 09/22] Only check for token latency requirements for server
 scenario (#2313)

---
 tools/submission/submission_checker.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/submission/submission_checker.py b/tools/submission/submission_checker.py
index 152a64040c..f124b808c5 100755
--- a/tools/submission/submission_checker.py
+++ b/tools/submission/submission_checker.py
@@ -1486,8 +1486,8 @@ def check_accuracy_dir(config, model, path, verbose):
 
 def extra_check_llm(mlperf_log, scenario, model):
     if mlperf_log["requested_use_token_latencies"]:
-        if scenario == "Offline":
-            # For offline no further checks are necessary
+        if scenario not in ["Server", "Interactive"]:
+            # For offline, singlestream and multistream no further checks are necessary
             return True
         else:
             limits = LLM_LATENCY_LIMITS[model][scenario]

From e41f484d786b768d76140bdeba5f3dc32906060f Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Thu, 31 Jul 2025 19:18:39 -0500
Subject: [PATCH 10/22] Use server SUT for SingleStream (#2314)

---
 language/llama3.1-8b/main.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/language/llama3.1-8b/main.py b/language/llama3.1-8b/main.py
index 341ba79374..a489e7d2bc 100644
--- a/language/llama3.1-8b/main.py
+++ b/language/llama3.1-8b/main.py
@@ -165,7 +165,7 @@ def main():
     else:
         raise NotImplementedError
 
-    sut_map = {"offline": SUT, "server": SUTServer, "singlestream": SUT}
+    sut_map = {"offline": SUT, "server": SUTServer, "singlestream": SUTServer}
 
     sut_cls = sut_map[args.scenario.lower()]
 

From 28c2fded25481e5ebd5f9c03712d89cf0ade87ed Mon Sep 17 00:00:00 2001
From: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Date: Mon, 4 Aug 2025 22:35:26 +0530
Subject: [PATCH 11/22] Update the default value for repository arg (#2317)

---
 tools/submission/generate_final_report.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/submission/generate_final_report.py b/tools/submission/generate_final_report.py
index 4c0ac7cb18..4cae3de5b7 100644
--- a/tools/submission/generate_final_report.py
+++ b/tools/submission/generate_final_report.py
@@ -22,7 +22,7 @@ def get_args():
     parser.add_argument('--version', default='5.1', help='mlperf version')
     parser.add_argument(
         '--repository',
-        default='submissions_inference_5.1',
+        default='submissions_inference_v5.1',
         help='mlperf repository')
     parser.add_argument(
         '--repository-owner',

From 71da6c88f212d35c7ff053e269e1c914fbf5007f Mon Sep 17 00:00:00 2001
From: Arjun Suresh <arjun@gateoverflow.com>
Date: Wed, 6 Aug 2025 13:28:22 +0100
Subject: [PATCH 12/22] Update preprocess_submission.py | Skip inferring
 offline scenario if absent for the model (#2316)

---
 tools/submission/preprocess_submission.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/submission/preprocess_submission.py b/tools/submission/preprocess_submission.py
index 3ff95bb484..8fa1e68676 100644
--- a/tools/submission/preprocess_submission.py
+++ b/tools/submission/preprocess_submission.py
@@ -460,8 +460,10 @@ def infer_scenario_results(args, config):
 
                                     # infer both the scenarios from SS
                                     if infer_scenario_results:
-                                        tobeinferredpaths = [
-                                            offline_scenario_path]
+                                        tobeinferredpaths = []
+                                        if "Offline" in all_scenarios:
+                                            tobeinferredpaths.append(
+                                                offline_scenario_path)
                                         if "MultiStream" in all_scenarios:
                                             tobeinferredpaths.append(
                                                 multistream_scenario_path)

From 62bebd7c23ea9c4efdfbb85ed75bd8e2ce9c72ec Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Wed, 6 Aug 2025 17:45:35 -0500
Subject: [PATCH 13/22] Fix: add llama3.1-8b-edge to generate_final_report
 (#2319)

---
 tools/submission/generate_final_report.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/submission/generate_final_report.py b/tools/submission/generate_final_report.py
index 4cae3de5b7..6be6656202 100644
--- a/tools/submission/generate_final_report.py
+++ b/tools/submission/generate_final_report.py
@@ -142,6 +142,7 @@ def main():
             "3d-unet-99",
             "3d-unet-99.9",
             "llama3.1-8b",
+            "llama3.1-8b-edge",
             "llama2-70b-99",
             "llama2-70b-99.9",
             "stable-diffusion-xl",

From e051a9f09e01a6212229e1fc628d698b870b1335 Mon Sep 17 00:00:00 2001
From: Anton Lokhmotov <psyhtest@users.noreply.github.com>
Date: Thu, 7 Aug 2025 16:13:06 +0100
Subject: [PATCH 14/22] Allow lowercase 'interactive' as scenario name (#2315)

---
 tools/submission/submission_checker.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/submission/submission_checker.py b/tools/submission/submission_checker.py
index f124b808c5..021de96167 100755
--- a/tools/submission/submission_checker.py
+++ b/tools/submission/submission_checker.py
@@ -726,6 +726,7 @@
     "multistream": "MultiStream",
     "server": "Server",
     "offline": "Offline",
+    "interactive": "Interactive",
 }
 
 RESULT_FIELD = {
@@ -1487,7 +1488,8 @@ def check_accuracy_dir(config, model, path, verbose):
 def extra_check_llm(mlperf_log, scenario, model):
     if mlperf_log["requested_use_token_latencies"]:
         if scenario not in ["Server", "Interactive"]:
-            # For offline, singlestream and multistream no further checks are necessary
+            # For offline, singlestream and multistream no further checks are
+            # necessary
             return True
         else:
             limits = LLM_LATENCY_LIMITS[model][scenario]
@@ -1887,7 +1889,7 @@ def get_power_metric(config, scenario_fixed, log_path, is_valid, res):
                 samples_per_query = 8
 
             if (scenario_fixed in ["MultiStream"]
-                    ) and scenario in ["SingleStream"]:
+                ) and scenario in ["SingleStream"]:
                 power_metric = (
                     avg_power * power_duration * samples_per_query * 1000 / num_queries
                 )

From 8583a96bbd9f5b283d76eb9b753ac01ba5ff6252 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Wed, 20 Aug 2025 15:12:00 -0500
Subject: [PATCH 15/22] Use sample latency as the metric for llama3.1_8b_edge
 SingleStream (#2324)

---
 tools/submission/submission_checker.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/submission/submission_checker.py b/tools/submission/submission_checker.py
index 021de96167..e9c931c4e9 100755
--- a/tools/submission/submission_checker.py
+++ b/tools/submission/submission_checker.py
@@ -801,6 +801,7 @@
         },
         "llama3.1-8b-edge": {
             "Offline": "result_tokens_per_second",
+            "SingleStream": "result_90.00_percentile_latency_ns",
         },
         "mixtral-8x7b": {
             "Offline": "result_tokens_per_second",

From 1161b2d752491d25c40ab8f1d121eba522688785 Mon Sep 17 00:00:00 2001
From: Anivar Aravind <anivar@weavez.in>
Date: Thu, 21 Aug 2025 08:47:58 +1000
Subject: [PATCH 16/22] Remove rclone references and update download
 instructions for DeepSeek-R1, Llama 3.1 8b, and Whisper (#2289)

* Remove rclone references and update download instructions for DeepSeek-R1, Llama 3.1 8b, and Whisper

- Replace rclone-based download instructions with new MLCommons downloader infrastructure
- Update DeepSeek-R1, Llama 3.1 8b, and Whisper READMEs to use https://inference.mlcommons-storage.org
- Maintain MLCFlow automation commands alongside native download methods
- Add file size information for each download
- Include -d flag documentation for custom download directories

Fixes #2265

* Update download instructions to use MLCommons R2 downloader with correct URIs

- Remove rclone-based download instructions
- Replace .json URLs with correct .uri files from metadata directory
- Update download commands for DeepSeek-R1, Llama 3.1 8b, and Whisper
- Use new MLCommons downloader infrastructure
- Remove file size information from download instructions

* Update downloader commands in README.md to include default -d flags

* Clarify separate datasets & model download commands in README.md

* Fix MLFlow -> MLCFlow typo in README.md

* MLCFlow commands update: model and dataset download

* MLCFlow commands update: accuracy and dataset download

* Fix typo in README.md

---------

Co-authored-by: Nathan Wasson <nathanw@mlcommons.org>
Co-authored-by: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Co-authored-by: Arjun Suresh <arjun@gateoverflow.com>
---
 language/deepseek-r1/README.md | 67 +++++++++++++++++++---------------
 language/llama3.1-8b/README.md | 50 ++++++++++++-------------
 speech2text/README.md          | 44 ++++++++++++----------
 3 files changed, 86 insertions(+), 75 deletions(-)

diff --git a/language/deepseek-r1/README.md b/language/deepseek-r1/README.md
index 2a4d85d7f2..a6c30a6155 100644
--- a/language/deepseek-r1/README.md
+++ b/language/deepseek-r1/README.md
@@ -1,6 +1,6 @@
-# Mlperf Inference DeepSeek Reference Implementation
+# MLPerf Inference DeepSeek Reference Implementation
 
-## Automated command to run the benchmark via MLFlow
+## Automated command to run the benchmark via MLCFlow
 
 Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/deepseek-r1/) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
 
@@ -13,6 +13,22 @@ You can also do pip install mlc-scripts and then use `mlcr` commands for downloa
 - DeepSeek-R1 model is automatically downloaded as part of setup
 - Checkpoint conversion is done transparently when needed.
 
+**Using the MLC R2 Downloader**
+
+Download the model using the MLCommons R2 Downloader:
+
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/deepseek-r1-0528.uri
+```
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/deepseek-r1-0528.uri
+```
+
 ## Dataset Download
 
 The dataset is an ensemble of the datasets: AIME, MATH500, gpqa, MMLU-Pro, livecodebench(code_generation_lite). They are covered by the following licenses:
@@ -23,49 +39,40 @@ The dataset is an ensemble of the datasets: AIME, MATH500, gpqa, MMLU-Pro, livec
 - MMLU-Pro: [MIT](https://opensource.org/license/mit)
 - livecodebench(code_generation_lite): [CC](https://creativecommons.org/share-your-work/cclicenses/)
 
-### Preprocessed
-
-**Using MLCFlow Automation**
-
-```
-mlcr get,dataset,whisper,_preprocessed,_mlc,_rclone --outdirname=<path to download> -j
-```
+### Preprocessed & Calibration
 
-**Using Native method**
+**Using the MLC R2 Downloader**
 
-You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
+Download the full preprocessed dataset and calibration dataset using the MLCommons R2 Downloader:
 
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+-d ./ https://inference.mlcommons-storage.org/metadata/deepseek-r1-datasets-fp8-eval.uri
 ```
-You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
 
-```
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/datasets/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
+This will download the full preprocessed dataset file (`mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`) and the calibration dataset file (`mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`).
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/deepseek-r1-datasets-fp8-eval.uri
 ```
 
-### Calibration
+### Preprocessed
 
 **Using MLCFlow Automation**
 
 ```
-mlcr get,preprocessed,dataset,deepseek-r1,_calibration,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,preprocessed,dataset,deepseek-r1,_validation,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
-**Using Native method**
-
-Download and install Rclone as described in the previous section.
+### Calibration
 
-Then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+**Using MLCFlow Automation**
 
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/datasets/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
+mlcr get,preprocessed,dataset,deepseek-r1,_calibration,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 ## Docker
@@ -204,7 +211,7 @@ The following table shows which backends support different evaluation and MLPerf
 **Using MLCFlow Automation**
 
 ```
-TBD
+mlcr run,accuracy,mlperf,_dataset_deepseek-r1 --result_dir=<Path to directory where files are generated after the benchmark run>
 ```
 
 **Using Native method**
diff --git a/language/llama3.1-8b/README.md b/language/llama3.1-8b/README.md
index 9dd571411b..1ba19c204b 100644
--- a/language/llama3.1-8b/README.md
+++ b/language/llama3.1-8b/README.md
@@ -104,7 +104,7 @@ You need to request for access to [MLCommons](http://llama3-1.mlcommons.org/) an
 **Official Model download using MLCFlow Automation**
 You can download the model automatically via the below command
 ```
-TBD
+mlcr get,ml-model,llama3,_mlc,_8b,_r2-downloader --outdirname=<path to download> -j
 ```
 
 
@@ -137,59 +137,57 @@ Downloading llama3.1-8b model from Hugging Face will require an [**access token*
 
 ### Preprocessed
 
-You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
-
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-```
-You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+Download the preprocessed datasets using the MLCommons downloader:
 
 #### Full dataset (datacenter) 
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,cnndm,_validation,_datacenter,_llama3,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,cnndm,_validation,_datacenter,_llama3,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 **Native method**
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-eval.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_eval.json ./ -P
-```
+This will download `cnn_eval.json`.
 
 #### 5000 samples (edge)
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,cnndm,_validation,_edge,_llama3,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,cnndm,_validation,_edge,_llama3,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 **Native method**
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-sample-cnn-eval-5000.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_eval_5000.json ./ -P
-```
+
+This will download `sample_cnn_eval_5000.json`.
+
 
 #### Calibration
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,cnndm,_calibration,_llama3,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,cnndm,_calibration,_llama3,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
 **Native method**
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-dailymail-calibration.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_dailymail_calibration.json ./ -P
-```
-
-You can also download the calibration dataset from the Cloudflare R2 bucket by running the following command:
+This will download `cnn_dailymail_calibration.json`.
 
-```
-rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/cnn_eval.json ./ -P
+To specify a custom download directory for any of these, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  <URI>
 ```
 
 
diff --git a/speech2text/README.md b/speech2text/README.md
index aae95e3c43..972fb965c2 100644
--- a/speech2text/README.md
+++ b/speech2text/README.md
@@ -102,26 +102,24 @@ VLLM_TARGET_DEVICE=cpu pip install --break-system-packages . --no-build-isolatio
 
 You can download the model automatically via the below command
 ```
-mlcr get,ml-model,whisper,_rclone,_mlc --outdirname=<path_to_download> -j
+mlcr get,ml-model,whisper,_r2-downloader,_mlc --outdirname=<path_to_download> -j
 ```
 
-**Official Model download using native method**
+**Official Model download using MLC R2 Downloader**
 
-You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
+Download the Whisper model using the MLCommons downloader:
 
-To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
-To install Rclone on Linux/macOS/BSD systems, run:
-```
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-```
-Once Rclone is installed, run the following command to authenticate with the bucket:
-```
-rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d whisper/model https://inference.mlcommons-storage.org/metadata/whisper-model.uri
 ```
-You can then navigate in the terminal to your desired download directory and run the following command to download the model:
 
-```
-rclone copy mlc-inference:mlcommons-inference-wg-public/Whisper/model/ ./ -P
+This will download the Whisper model files.
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/whisper-model.uri
 ```
 
 ### External Download (Not recommended for official submission)
@@ -153,16 +151,24 @@ We use dev-clean and dev-other splits, which are approximately 10 hours.
 
 **Using MLCFlow Automation**
 ```
-mlcr get,dataset,whisper,_preprocessed,_mlc,_rclone --outdirname=<path to download> -j
+mlcr get,dataset,whisper,_preprocessed,_mlc,_r2-downloader --outdirname=<path to download> -j
 ```
 
-**Native method**
+**Using MLC R2 Downloader**
 
-Download and install rclone as decribed in the [MLCommons Download section](#mlcommons-download)
+Download the preprocessed dataset using the MLCommons R2 Downloader:
 
-You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d whisper/dataset https://inference.mlcommons-storage.org/metadata/whisper-dataset.uri
 ```
-rclone copy mlc-inference:mlcommons-inference-wg-public/Whisper/dataset/ ./ -P
+
+This will download the LibriSpeech dataset files.
+
+To specify a custom download directory, use the `-d` flag:
+```bash
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
+  -d /path/to/download/directory \
+  https://inference.mlcommons-storage.org/metadata/whisper-dataset.uri
 ```
 
 ### Unprocessed

From aa8b5da1e8e61ce5b9dd4e6d8b5863d80910987e Mon Sep 17 00:00:00 2001
From: Anivar A Aravind <anivar@weavez.in>
Date: Thu, 21 Aug 2025 10:03:40 +0530
Subject: [PATCH 17/22] Fix preprocessing documentation with verified
 implementations

Update PREPROCESSING.md files with correct information based on actual code.

- DeepSeek-R1: Use apply_chat_template, 32K context
- Llama 3.1-8B: Use instruction template for summarization
- Add general preprocessing guide and examples
---
 language/PREPROCESSING_GUIDE.md       | 139 +++++++++++++++++++++
 language/deepseek-r1/PREPROCESSING.md | 153 ++++++++---------------
 language/llama3.1-8b/PREPROCESSING.md | 101 +++++-----------
 language/preprocessing_examples.py    | 168 ++++++++++++++++++++++++++
 4 files changed, 388 insertions(+), 173 deletions(-)
 create mode 100644 language/PREPROCESSING_GUIDE.md
 create mode 100644 language/preprocessing_examples.py

diff --git a/language/PREPROCESSING_GUIDE.md b/language/PREPROCESSING_GUIDE.md
new file mode 100644
index 0000000000..0fef3d0758
--- /dev/null
+++ b/language/PREPROCESSING_GUIDE.md
@@ -0,0 +1,139 @@
+# MLCommons Inference - General Preprocessing Guide
+
+## Overview
+
+This guide covers common preprocessing patterns across all language models in MLCommons Inference benchmarks. Preprocessing varies by:
+1. Model architecture
+2. Backend choice (PyTorch, vLLM, SGLang)
+3. Task type (summarization, Q&A, etc.)
+
+## Common Tokenizer Setup Pattern
+
+Most models follow this pattern:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.padding_side = "left"  # Critical for generation
+tokenizer.pad_token = tokenizer.eos_token
+```
+
+## Backend Dependencies
+
+Different backends have different preprocessing requirements:
+
+| Backend | Input Type | Chat Template Support | Use Case |
+|---------|------------|---------------------|----------|
+| PyTorch | Tokenized | Varies by model | Distributed inference |
+| vLLM | Text | Varies by model | High-throughput serving |
+| SGLang | Text | Usually disabled | Optimized serving |
+
+## Dataset Format
+
+All models expect datasets with these common fields:
+
+```python
+{
+    'text_input': str,      # Raw prompt text (required)
+    'tok_input': List[int], # Pre-tokenized input (optional)
+    'output': str,          # Expected output for evaluation
+}
+```
+
+## Model-Specific Preprocessing
+
+### Models Using Chat Templates
+- **DeepSeek-R1**: Uses `apply_chat_template` with PyTorch/vLLM
+- **Potential others**: Check `uses_chat_template` in backend registry
+
+### Models Using Simple Templates
+- **Llama 3.1-8B**: Instruction format for summarization
+- **Llama 2-70B**: Custom format with `[INST]` markers
+- **Mixtral-8x7B**: Simple instruction format
+
+### Models Using Raw Prompts
+- **GPT-J**: Completion-style, no special formatting
+
+## Preprocessing Steps
+
+1. **Load the tokenizer** with appropriate configuration
+2. **Apply model-specific formatting** (chat template or instruction format)
+3. **Tokenize** with proper truncation and max length
+4. **Handle padding** (left-side for generation models)
+
+## Example: Generic Preprocessing Function
+
+```python
+def preprocess_for_model(text, model_name, backend="pytorch"):
+    """Generic preprocessing based on model and backend"""
+    
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+    
+    # Check if chat template should be used
+    if should_use_chat_template(model_name, backend):
+        tokens = tokenizer.apply_chat_template(
+            [{"role": "user", "content": text}],
+            add_generation_prompt=True,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+    else:
+        # Apply model-specific template or use raw text
+        formatted_text = apply_model_template(text, model_name)
+        tokens = tokenizer.encode(
+            formatted_text,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+    
+    return tokens
+```
+
+## Max Context Lengths
+
+| Model | Max Length | Notes |
+|-------|------------|-------|
+| DeepSeek-R1 | 32,768 | 32K context |
+| Llama 3.1-8B | 8,000 | For preprocessing |
+| Llama 2-70B | 1,024 | Limited context |
+| Mixtral-8x7B | 1,024 | From dataset.py |
+| GPT-J | ~2,048 | Standard GPT-J limit |
+
+## Running Inference
+
+```bash
+# Set backend
+export MLPERF_BACKEND=pytorch  # or vllm, sglang
+
+# PyTorch backend (distributed)
+torchrun --nproc_per_node=8 run_eval_mpi.py --input-file data.pkl
+
+# vLLM/SGLang backends
+python run_eval.py --input-file data.pkl
+```
+
+## Common Issues
+
+1. **Wrong padding side**: Always use `padding_side="left"` for generation
+2. **Missing pad token**: Set `pad_token = eos_token`
+3. **Backend mismatch**: Ensure preprocessing matches backend requirements
+4. **Context overflow**: Respect model's maximum context length
+
+## Validation
+
+To ensure correct preprocessing:
+
+1. Check tokenized length doesn't exceed max
+2. Verify special tokens are properly placed
+3. Test with a few examples before full dataset
+4. Compare against reference outputs
+
+## References
+
+- Model-specific guides in each model's directory
+- Backend configuration in `utils/backend_registry.py`
+- Tokenization utilities in `utils/tokenization.py`
\ No newline at end of file
diff --git a/language/deepseek-r1/PREPROCESSING.md b/language/deepseek-r1/PREPROCESSING.md
index 9f813197bb..2b5aa2cee9 100644
--- a/language/deepseek-r1/PREPROCESSING.md
+++ b/language/deepseek-r1/PREPROCESSING.md
@@ -1,113 +1,56 @@
-# Dataset Preprocessing Documentation - DeepSeek-R1
-
-## Model: DeepSeek-R1
-**Dataset:** Multi-domain Evaluation Ensemble  
-**Evaluation Task:** Multi-domain Reasoning and Code Generation
-
-## Data Source
-- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
-- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
-- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
-- **Licenses:** 
-  - AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
-  - MATH500: [MIT](https://opensource.org/license/mit)
-  - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
-  - MMLU-Pro: [MIT](https://opensource.org/license/mit)
-  - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
-
-## Current Implementation
-
-### Files Available
-- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
-- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
-- **Format:** Preprocessed pickle files ready for evaluation
-
-### Download Process
-```bash
-# Install Rclone
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-
-# Configure access
-rclone config create mlc-inference s3 provider=Cloudflare \
-  access_key_id=f65ba5eef400db161ea49967de89f47b \
-  secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
-  endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-
-# Download datasets
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
+# DeepSeek-R1 Preprocessing
+
+## Model Configuration
+- **Model**: `deepseek-ai/DeepSeek-R1`
+- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad`
+- **Max Length**: 32,768 tokens (32K)
+
+## Tokenization
+```python
+from transformers import AutoTokenizer
+
+# From utils/tokenization.py
+tokenizer = AutoTokenizer.from_pretrained(
+    "deepseek-ai/DeepSeek-R1",
+    revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
+)
 ```
 
-## Missing Documentation (Addresses Issue #2245)
-
-The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
-
-### 1. Original Data Sources
-- **Raw Dataset Locations:** Where each component dataset was obtained
-- **Version Information:** Specific versions/commits of source datasets
-- **Access Methods:** How to obtain raw data independently
-
-### 2. Preprocessing Pipeline
-- **Tokenization Method:** Which tokenizer was used and configuration
-- **Input Formatting:** How different dataset formats were standardized
-- **Quality Filtering:** Criteria for sample inclusion/exclusion
-- **Ensemble Strategy:** How multiple datasets were combined
-
-### 3. Dataset Statistics
-- **Sample Counts:** Number of samples from each component dataset
-- **Distribution:** How samples are balanced across domains
-- **Difficulty Levels:** Complexity distribution of included problems
+## Preprocessing Method
 
-### 4. Validation Process
-- **Quality Control:** How preprocessing quality was verified
-- **Consistency Checks:** Validation of format standardization
-- **Error Handling:** How malformed samples were addressed
+The preprocessing varies by backend:
 
-## Adaptation Challenges
-
-**For Different Tokenizers:**
-- Cannot modify tokenization without access to raw data
-- No documentation of original tokenization parameters
-- Unable to test preprocessing consistency
-
-**For Different Models:**
-- Cannot adapt input formatting without preprocessing scripts
-- No guidance on prompt template modifications
-- Unable to reproduce dataset with different filtering criteria
-
-## Recommended Improvements
-
-To fully address issue #2245 and improve reproducibility:
-
-### 1. Raw Data Access
-- Provide scripts to download original datasets
-- Document exact versions and sources used
-- Include data licenses and attribution
-
-### 2. Preprocessing Scripts
-- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
-- Document tokenization and formatting steps
-- Include quality filtering logic
-
-### 3. Documentation
-- Add detailed preprocessing methodology
-- Include dataset statistics and composition
-- Provide adaptation guidelines
+### PyTorch/vLLM Backends (Chat Template Enabled)
+```python
+# From utils/tokenization.py
+tokens = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt}],
+    add_generation_prompt=True,
+    max_length=32768,
+    truncation=True
+)
+```
 
-### 4. Validation
-- Include preprocessing verification scripts
-- Document expected outputs and checksums
-- Provide quality metrics
+### SGLang Backend (No Chat Template)
+```python
+tokens = tokenizer.encode(
+    prompt,
+    truncation=True,
+    max_length=32768
+)
+```
 
-## Temporary Workaround
+## Backend Configuration
+| Backend | uses_chat_template | input_type |
+|---------|-------------------|------------|
+| PyTorch | True | tokenized |
+| vLLM | True | text |
+| SGLang | False | text |
 
-Until full preprocessing documentation is available:
-1. Use provided preprocessed datasets for standard evaluation
-2. Contact maintainers for specific adaptation requirements
-3. Reference `llama2-70b/processorca.py` for preprocessing patterns
-4. Consider contributing preprocessing scripts based on reverse engineering
+## Dataset Format
+Input data should have a `text_input` column containing the prompts.
 
-## See Also
-- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
-- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
-- Repository issue #2245 - Discussion of preprocessing documentation gaps
\ No newline at end of file
+## Accuracy Target
+```
+"mean-accuracy": 81.3582
+```
\ No newline at end of file
diff --git a/language/llama3.1-8b/PREPROCESSING.md b/language/llama3.1-8b/PREPROCESSING.md
index 84f0914fbf..2ff10d7c6e 100644
--- a/language/llama3.1-8b/PREPROCESSING.md
+++ b/language/llama3.1-8b/PREPROCESSING.md
@@ -1,82 +1,47 @@
-# Dataset Preprocessing Documentation - Llama3.1-8B
+# Llama 3.1 8B Preprocessing
 
-## Model: Llama3.1-8B
-**Dataset:** CNN/DailyMail 3.0.0  
-**Evaluation Task:** Text Summarization
+## Model Configuration
+- **Model**: `meta-llama/Llama-3.1-8B-Instruct`
+- **Revision**: `be673f326cab4cd22ccfef76109faf68e41aa5f1` (for download)
+- **Max Length**: 8,000 tokens (in preprocessing scripts)
 
-## Data Source
-- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
-- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")`
-- **License:** Apache 2.0
-- **Download Script:** `download_cnndm.py`
-
-## Preprocessing Pipeline
-
-### 1. Tokenization
+## Tokenization
 ```python
 from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
+
+# From prepare-calibration.py
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
 tokenizer.padding_side = "left"
 tokenizer.pad_token = tokenizer.eos_token
 tokenizer.model_max_length = 8000
 ```
 
-### 2. Input Template
-```
-Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
-
-Article:
-{article}
+## Prompt Template (CNN/DailyMail Summarization)
+```python
+# From prepare-calibration.py and download_cnndm.py
+instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:"
 
-Summary:
+# Tokenize
+x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
 ```
 
-### 3. Current Implementation
-- **Download:** `download_cnndm.py` loads CNN/DailyMail dataset
-- **Calibration:** `prepare-calibration.py` creates calibration subset
-- **Evaluation:** Uses `evaluation.py` for accuracy assessment
-
-## Missing Documentation (Addresses Issue #2245)
-
-The following preprocessing steps are **not currently documented** but would be needed for full reproducibility:
-
-### 4. Filtering Steps (Recommended)
-Based on `llama2-70b/processorca.py` patterns:
-- **Language Filter:** English-only content validation
-- **Length Filter:** Input/output sequence length limits
-- **Quality Filter:** Remove very short summaries
-- **Content Filter:** Handle special characters and formatting
+**Note**: This uses a simple instruction format, NOT the chat template with special tokens.
 
-### 5. Sampling Strategy (Recommended)
-- **Dataset Size:** Specify number of evaluation samples
-- **Selection Method:** Random vs stratified sampling
-- **Validation:** How to verify preprocessing consistency
-
-## Adaptation Guide
-
-**For Different Tokenizers:**
-1. Update `model-id` parameter in scripts
-2. Adjust `model_max_length` based on tokenizer capabilities
-3. Verify special token handling (pad_token, eos_token)
-
-**For Different Models:**
-1. Modify input template format
-2. Adjust summary length requirements (currently 128 tokens)
-3. Update evaluation criteria as needed
-
-## Files Generated
-- **Main Dataset:** Downloaded via `download_cnndm.py`
-- **Calibration Set:** Generated via `prepare-calibration.py`
-- **Format:** Standard CNN/DailyMail format from Hugging Face
-
-## Next Steps for Full Reproducibility
-
-To fully address issue #2245, consider adding:
-1. Complete preprocessing script (similar to `llama2-70b/processorca.py`)
-2. Documentation of filtering criteria
-3. Sampling methodology
-4. Quality validation steps
+## Dataset Preparation
+```python
+# Example from prepare-calibration.py
+x = dict()
+x["instruction"] = instruction_template
+x["input"] = calibration_sample["article"]
+x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
+x["output"] = calibration_sample["highlights"]
+```
 
-## See Also
-- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
-- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
\ No newline at end of file
+## Accuracy Targets (BF16)
+```
+Datacenter:
+- rouge1: 38.7792
+- rouge2: 15.9075
+- rougeL: 24.4957
+- rougeLsum: 35.793
+```
\ No newline at end of file
diff --git a/language/preprocessing_examples.py b/language/preprocessing_examples.py
new file mode 100644
index 0000000000..bf3fa02973
--- /dev/null
+++ b/language/preprocessing_examples.py
@@ -0,0 +1,168 @@
+#!/usr/bin/env python3
+"""
+MLCommons Inference - Preprocessing Examples
+
+This script demonstrates correct preprocessing for different models.
+Based on actual implementations in the codebase.
+"""
+
+from transformers import AutoTokenizer
+import pandas as pd
+
+
+def preprocess_deepseek_r1(prompts, use_chat_template=True):
+    """
+    Preprocess prompts for DeepSeek-R1 model.
+    
+    Args:
+        prompts: List of text prompts
+        use_chat_template: Whether to use chat template (depends on backend)
+    
+    Returns:
+        List of tokenized prompts
+    """
+    tokenizer = AutoTokenizer.from_pretrained(
+        "deepseek-ai/DeepSeek-R1",
+        revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
+    )
+    
+    tokenized = []
+    for prompt in prompts:
+        if use_chat_template and hasattr(tokenizer, 'apply_chat_template'):
+            tokens = tokenizer.apply_chat_template(
+                [{"role": "user", "content": prompt}],
+                add_generation_prompt=True,
+                max_length=32768,
+                truncation=True
+            )
+        else:
+            tokens = tokenizer.encode(
+                prompt,
+                truncation=True,
+                max_length=32768
+            )
+        tokenized.append(tokens)
+    
+    return tokenized
+
+
+def preprocess_llama31_8b(articles):
+    """
+    Preprocess articles for Llama 3.1-8B summarization.
+    
+    Args:
+        articles: List of articles to summarize
+    
+    Returns:
+        List of tokenized prompts
+    """
+    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+    tokenizer.model_max_length = 8000
+    
+    # Template from prepare-calibration.py
+    instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:"
+    
+    tokenized = []
+    for article in articles:
+        prompt = instruction_template.format(input=article)
+        tokens = tokenizer.encode(prompt, max_length=8000, truncation=True)
+        tokenized.append(tokens)
+    
+    return tokenized
+
+
+def preprocess_llama2_70b(prompts, system_prompts=None):
+    """
+    Preprocess prompts for Llama 2-70B model.
+    
+    Args:
+        prompts: List of user prompts
+        system_prompts: Optional list of system prompts
+    
+    Returns:
+        List of tokenized prompts
+    """
+    tokenizer = AutoTokenizer.from_pretrained(
+        "meta-llama/Llama-2-70b-chat-hf",
+        use_fast=False
+    )
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+    
+    # Templates from processorca.py
+    llama_prompt_system = "<s>[INST] <<SYS>>\n{}\n<</SYS>>\n\n{} [/INST]"
+    llama_prompt_no_system = "<s>[INST] {} [/INST]"
+    
+    tokenized = []
+    for i, prompt in enumerate(prompts):
+        if system_prompts and system_prompts[i]:
+            formatted = llama_prompt_system.format(system_prompts[i], prompt)
+        else:
+            formatted = llama_prompt_no_system.format(prompt)
+        
+        tokens = tokenizer.encode(formatted, max_length=1024, truncation=True)
+        tokenized.append(tokens)
+    
+    return tokenized
+
+
+def create_dataset_format(prompts, tokenized_prompts, outputs=None):
+    """
+    Create dataset in expected format for MLCommons.
+    
+    Args:
+        prompts: List of text prompts
+        tokenized_prompts: List of tokenized prompts
+        outputs: Optional list of expected outputs
+    
+    Returns:
+        DataFrame in expected format
+    """
+    data = {
+        'text_input': prompts,
+        'tok_input': tokenized_prompts,
+    }
+    
+    if outputs:
+        data['output'] = outputs
+    
+    return pd.DataFrame(data)
+
+
+# Example usage
+if __name__ == "__main__":
+    # Example 1: DeepSeek-R1
+    print("=== DeepSeek-R1 Example ===")
+    deepseek_prompts = [
+        "What is machine learning?",
+        "Explain quantum computing in simple terms."
+    ]
+    
+    # With chat template (PyTorch/vLLM)
+    deepseek_tokens = preprocess_deepseek_r1(deepseek_prompts, use_chat_template=True)
+    print(f"Prompt 1 token count: {len(deepseek_tokens[0])}")
+    
+    # Without chat template (SGLang)
+    deepseek_tokens_no_chat = preprocess_deepseek_r1(deepseek_prompts, use_chat_template=False)
+    print(f"Prompt 1 token count (no chat): {len(deepseek_tokens_no_chat[0])}")
+    
+    # Example 2: Llama 3.1-8B
+    print("\n=== Llama 3.1-8B Example ===")
+    articles = [
+        "The United Nations announced today a new climate initiative aimed at reducing global emissions by 50% by 2030. The plan includes partnerships with major corporations and governments worldwide."
+    ]
+    
+    llama_tokens = preprocess_llama31_8b(articles)
+    print(f"Article 1 token count: {len(llama_tokens[0])}")
+    
+    # Example 3: Create dataset
+    print("\n=== Dataset Format Example ===")
+    df = create_dataset_format(deepseek_prompts, deepseek_tokens)
+    print(df.head())
+    print(f"\nDataset shape: {df.shape}")
+    print(f"Columns: {list(df.columns)}")
+    
+    # Save example
+    # df.to_pickle("preprocessed_data.pkl")
\ No newline at end of file

From a2aaa919418ca79649d2fb306ca8258609b3052f Mon Sep 17 00:00:00 2001
From: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Date: Mon, 1 Sep 2025 21:02:09 +0530
Subject: [PATCH 18/22] hide long time untested implementations from docs
 (#2328)

---
 docs/benchmarks/image_classification/resnet50.md |  4 ++++
 docs/benchmarks/language/bert.md                 |  2 ++
 docs/benchmarks/language/gpt-j.md                |  2 ++
 docs/benchmarks/language/llama2-70b.md           |  2 ++
 docs/benchmarks/medical_imaging/3d-unet.md       |  2 ++
 docs/benchmarks/object_detection/retinanet.md    |  2 ++
 docs/benchmarks/recommendation/dlrm-v2.md        |  2 ++
 docs/benchmarks/text_to_image/sdxl.md            |  3 ++-
 mkdocs.yml                                       | 10 ++--------
 9 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/docs/benchmarks/image_classification/resnet50.md b/docs/benchmarks/image_classification/resnet50.md
index 4172158dcc..3840e6be6b 100644
--- a/docs/benchmarks/image_classification/resnet50.md
+++ b/docs/benchmarks/image_classification/resnet50.md
@@ -17,6 +17,8 @@ hide:
     
 {{ mlperf_inference_implementation_readme (4, "resnet50", "nvidia") }}
 
+<!-->
+
 === "Intel"
     ## Intel MLPerf Implementation
     
@@ -31,3 +33,5 @@ hide:
     ## MLPerf Modular Implementation in C++
     
 {{ mlperf_inference_implementation_readme (4, "resnet50", "cpp") }}
+
+-->
diff --git a/docs/benchmarks/language/bert.md b/docs/benchmarks/language/bert.md
index 57b1cf2246..51ac91c86e 100644
--- a/docs/benchmarks/language/bert.md
+++ b/docs/benchmarks/language/bert.md
@@ -19,6 +19,7 @@ hide:
 
 {{ mlperf_inference_implementation_readme (4, "bert-99.9", "nvidia") }}
 
+<!-- 
 === "Intel"
     ## Intel MLPerf Implementation
     
@@ -32,3 +33,4 @@ hide:
 {{ mlperf_inference_implementation_readme (4, "bert-99", "qualcomm") }}
 
 {{ mlperf_inference_implementation_readme (4, "bert-99.9", "qualcomm") }}
+ -->s
\ No newline at end of file
diff --git a/docs/benchmarks/language/gpt-j.md b/docs/benchmarks/language/gpt-j.md
index d2f5458152..7c230593a8 100644
--- a/docs/benchmarks/language/gpt-j.md
+++ b/docs/benchmarks/language/gpt-j.md
@@ -23,6 +23,7 @@ hide:
 
 {{ mlperf_inference_implementation_readme (4, "gptj-99.9", "nvidia") }}
 
+<!-- 
 === "Intel"
     ## Intel MLPerf Implementation
 
@@ -35,3 +36,4 @@ hide:
 
 {{ mlperf_inference_implementation_readme (4, "gptj-99", "qualcomm") }}
 
+-->
\ No newline at end of file
diff --git a/docs/benchmarks/language/llama2-70b.md b/docs/benchmarks/language/llama2-70b.md
index 40c62cf714..effc7b18a6 100644
--- a/docs/benchmarks/language/llama2-70b.md
+++ b/docs/benchmarks/language/llama2-70b.md
@@ -19,6 +19,7 @@ hide:
 
 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "nvidia") }}
 
+<!-- 
 === "Neural Magic"
     ## Neural Magic MLPerf Implementation
     
@@ -32,3 +33,4 @@ hide:
 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "amd") }}
 
 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "amd") }}
+-->
\ No newline at end of file
diff --git a/docs/benchmarks/medical_imaging/3d-unet.md b/docs/benchmarks/medical_imaging/3d-unet.md
index 72d5eed493..c5a8a56800 100644
--- a/docs/benchmarks/medical_imaging/3d-unet.md
+++ b/docs/benchmarks/medical_imaging/3d-unet.md
@@ -22,6 +22,7 @@ hide:
 
 {{ mlperf_inference_implementation_readme (4, "3d-unet-99.9", "nvidia") }}
 
+<!--  
 === "Intel"
     ## Intel MLPerf Implementation
 
@@ -29,3 +30,4 @@ hide:
 
 
 {{ mlperf_inference_implementation_readme (4, "3d-unet-99.9", "intel") }}
+-->
\ No newline at end of file
diff --git a/docs/benchmarks/object_detection/retinanet.md b/docs/benchmarks/object_detection/retinanet.md
index 699d920504..0461379dd6 100644
--- a/docs/benchmarks/object_detection/retinanet.md
+++ b/docs/benchmarks/object_detection/retinanet.md
@@ -15,6 +15,7 @@ hide:
     
 {{ mlperf_inference_implementation_readme (4, "retinanet", "nvidia") }}
 
+<!-- 
 === "Intel"
     ## Intel MLPerf Implementation
     
@@ -29,3 +30,4 @@ hide:
     ## MLPerf Modular Implementation in C++
     
 {{ mlperf_inference_implementation_readme (4, "retinanet", "cpp") }}
+ -->
\ No newline at end of file
diff --git a/docs/benchmarks/recommendation/dlrm-v2.md b/docs/benchmarks/recommendation/dlrm-v2.md
index b539c1607e..18a38af333 100644
--- a/docs/benchmarks/recommendation/dlrm-v2.md
+++ b/docs/benchmarks/recommendation/dlrm-v2.md
@@ -19,9 +19,11 @@ hide:
 
 {{ mlperf_inference_implementation_readme (4, "dlrm-v2-99.9", "nvidia") }}
 
+<!-- 
 === "Intel"
     ## Intel MLPerf Implementation
     
 {{ mlperf_inference_implementation_readme (4, "dlrm-v2-99", "intel") }}
 
 {{ mlperf_inference_implementation_readme (4, "dlrm-v2-99.9", "intel") }}
+-->
\ No newline at end of file
diff --git a/docs/benchmarks/text_to_image/sdxl.md b/docs/benchmarks/text_to_image/sdxl.md
index 575f4cabbf..622fc2a622 100644
--- a/docs/benchmarks/text_to_image/sdxl.md
+++ b/docs/benchmarks/text_to_image/sdxl.md
@@ -16,7 +16,8 @@ hide:
     
 {{ mlperf_inference_implementation_readme (4, "sdxl", "nvidia") }}
 
+<!-- 
 === "Intel"
     ## Intel MLPerf Implementation
 {{ mlperf_inference_implementation_readme (4, "sdxl", "intel") }}
-
+-->
diff --git a/mkdocs.yml b/mkdocs.yml
index db1a1fb23a..730d7eff2d 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -29,10 +29,7 @@ nav:
       - Image Classification:
           - ResNet50: benchmarks/image_classification/resnet50.md
       - Text to Image:
-          - Stable Diffusion:
-              - Run Commands: benchmarks/text_to_image/sdxl.md
-              - Reproducibility:
-                  - SCC24: benchmarks/text_to_image/reproducibility/scc24.md
+          - Stable Diffusion: benchmarks/text_to_image/sdxl.md
       - 2D Object Detection:
           - RetinaNet: benchmarks/object_detection/retinanet.md
       - Automotive:
@@ -41,10 +38,7 @@ nav:
       - Medical Imaging:
           - 3d-unet: benchmarks/medical_imaging/3d-unet.md
       - Language Processing:
-        - Bert-Large:
-          - Run Commands: benchmarks/language/bert.md
-          - Reproducibility:
-            - IndySCC24: benchmarks/language/reproducibility/indyscc24-bert.md
+        - Bert-Large: benchmarks/language/bert.md
         - GPT-J: benchmarks/language/gpt-j.md
         - LLAMA2-70B: benchmarks/language/llama2-70b.md
         - LLAMA3-405B: benchmarks/language/llama3_1-405b.md

From 5f8019a520a725142213b86c5baf7b99d3b02fdb Mon Sep 17 00:00:00 2001
From: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Date: Wed, 10 Sep 2025 22:08:00 +0530
Subject: [PATCH 19/22] Initial draft for SCC 25 documentation (#2331)

* Initial draft for SCC 25 documentation

* Update scc25.md
---
 docs/benchmarks/language/scc25_guide/scc25.md | 109 ++++++++++++++++++
 main.py                                       |  51 ++++----
 mkdocs.yml                                    |  10 +-
 3 files changed, 148 insertions(+), 22 deletions(-)
 create mode 100644 docs/benchmarks/language/scc25_guide/scc25.md

diff --git a/docs/benchmarks/language/scc25_guide/scc25.md b/docs/benchmarks/language/scc25_guide/scc25.md
new file mode 100644
index 0000000000..618cfcd74f
--- /dev/null
+++ b/docs/benchmarks/language/scc25_guide/scc25.md
@@ -0,0 +1,109 @@
+---
+hide:
+  - toc
+---
+
+# Text Summarization with Llama2-70b for Student Cluster Competition 2025
+
+## Introduction
+
+This guide is designed for the [Student Cluster Competition 2025](https://sc25.supercomputing.org/students/student-cluster-competition/) to walk participants through running and optimizing the [MLPerf Inference Benchmark](https://arxiv.org/abs/1911.02549) using [Llama2 70b](https://github.com/mlcommons/inference/tree/master/language/llama2-70b) across various software and hardware configurations. The goal is to maximize system throughput (measured in Tokens per second) without compromising accuracy. Since the model performs poorly on CPUs, it is essential to run it on GPUs.
+
+For a valid MLPerf Inference submission in this competition, you must run both a performance test and an accuracy test—**no compliance runs are required**. We use the **Offline** scenario, where throughput is the key metric (higher is better). For Llama 2-70B with the OpenOrca dataset (24,576 samples), the **performance run** must process an integer multiple of the full dataset (24,576 × *N* samples), while the **accuracy run** must process **exactly** the full dataset (24,576 samples). These requirements are taken care of by the MLPerf inference implementations. Setup for NVIDIA GPUs typically takes 2–3 hours and can be done offline. The final output is a tarball (`mlperf_submission.tar.gz`) containing MLPerf-compatible results which can be submitted to the organizers via a CLI command.
+
+## Scoring
+
+In the SCC, your first objective will be to get a valid MLPerf benchmark run. Traditionally running the reference MLPerf inference implementation (in Python) is easier compared to running Nvidia MLPerf inference implementation. Since for SCC25 we are having the Llama2-70b model, running the reference implementation needs around 600GB of VRAM and is tested only on 8xH100 Nvidia GPUs. If you have lower VRAM, trying the vendor implementation like of Nvidia or AMD is the best option.  
+
+MLCommons provides [automation](https://github.com/mlcommons/mlperf-automations/) to run the MLPerf inference benchmarks which you can make use of. Currently the automation supports the reference implementation as well as Nvidia implementation and this is useful for you to get a quick valid result as the automation produces the required final output. You can also use the manual steps by following the [reference](https://github.com/mlcommons/inference/tree/master/language/llama2-70b), [Nvidia](https://github.com/mlcommons/inference_results_v5.0/tree/main/closed/NVIDIA) or [AMD](https://github.com/mlcommons/inference_results_v5.0/tree/main/closed/AMD) implementation readmes.
+
+Once the initial run is successful, you'll have the opportunity to optimize the benchmark further by maximizing system utilization, applying quantization techniques, adjusting ML frameworks, experimenting with batch sizes, and more, all of which can earn you additional points.
+
+Since vendor implementations of the MLPerf inference benchmark vary, teams will compete within their respective hardware categories (e.g., Nvidia GPUs, AMD GPUs). Points will be awarded based on the throughput achieved on your system.
+
+Additionally, significant bonus points will be awarded if your team enhances an existing implementation, enables multi-node execution, or adds/extends scripts to [mlperf-automations repository](https://github.com/mlcommons/mlperf-automations/tree/dev/script) supporting new devices, frameworks, implementations etc. All improvements must be made publicly available under the Apache 2.0 license and submitted as pull requests by November 10, 2025 and only the code which is *merge ready* will be considered for evaluation. As a guideline, below are some examples which can fetch you bonus points. 
+
+* Adds multi-node execution support for Nvidia, AMD or reference implementations
+* Support automation for AMD implementation
+* Supports fp8/fp4 quantization for Reference implementation
+* Automate the [network reference implementation](https://github.com/mlcommons/inference/blob/master/language/llama2-70b/SUT_API.py) (this uses OpenAI compatible endpoints)
+* The MLPerf automation supports docker run of Nvidia implementation. Supporting apptainer is a valuable contribution
+
+PS: For any query regarding the contribution, feel free to raise an issue in the [Inference](https://github.com/mlcommons/inference) or [MLPerf automations](https://github.com/mlcommons/mlperf-automations) repositories.
+
+!!! info
+    Both MLPerf and MLC automation are evolving projects.
+    If you encounter issues related to SCC, please submit them [here](https://github.com/mlcommons/inference/issues) with **scc-25** label
+    with proper information about the command used, error logs and any additional usefull information to debug the issue.
+
+## Artifacts to submit to the SCC committee
+
+You will need to submit the following files:
+
+* `mlperf_submission.run` - MLC commands to run MLPerf inference benchmark saved to this file.
+* `mlperf_submission.md` - description of your platform and some highlights of the MLPerf benchmark execution.
+* `<Team Name>` under which results are pushed to the github repository. 
+
+
+## SCC interview
+
+You are encouraged to highlight and explain the obtained MLPerf inference throughput on your system
+and describe any improvements and extensions to this benchmark (such as adding new hardware backend
+or supporting multi-node execution) useful for the community and [MLCommons](https://mlcommons.org).
+
+## Run Commands
+
+=== "MLCommons-Python"
+    ## MLPerf Reference Implementation in Python
+    
+{{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
+
+{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.99", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
+
+=== "Nvidia"
+    ## Nvidia MLPerf Implementation
+
+{{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "nvidia", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
+
+{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.99", "nvidia", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
+
+## Submission Commands
+
+### Generate actual submission tree
+
+
+```bash
+mlcr generate,inference,submission,_wg-inference \
+   --clean \
+   --run-checker \
+   --tar=yes \
+   --env.MLC_TAR_OUTFILE=submission.tar.gz \
+   --division=open \
+   --category=datacenter \
+   --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \
+   --quiet \
+   --submitter=<Team Name>
+```
+
+* Use `--hw_name="My system name"` to give a meaningful system name.
+* At the end, a **.tar** file would be generated inside the current working directory.
+
+### Submit Results
+
+> **Note:**
+Further instructions on the final submission will be published as the deadline approaches.
+
+<!-- Fork the `mlperf-inference-results-scc25` branch of the repository URL at [mlperf-automations](https://github.com/mlcommons/mlperf-automations). 
+
+Run the following command after **replacing `--repo_url` with your GitHub fork URL**.
+
+```bash
+mlcr push,github,mlperf,inference,submission \
+   --repo_url=https://github.com/<myfork>/mlperf-automations \
+   --repo_branch=mlperf-inference-results-scc25 \
+   --commit_message="Results on system <HW Name>" \
+   --quiet
+```
+
+Once uploaded give a Pull Request to the origin repository. Github action will be running there and once 
+finished you can see your submitted results at [https://docs.mlcommons.org/mlperf-automations](https://docs.mlcommons.org/mlperf-automations). -->
diff --git a/main.py b/main.py
index 419b76a6da..37c70c9d07 100755
--- a/main.py
+++ b/main.py
@@ -28,6 +28,7 @@ def mlperf_inference_implementation_readme(
         content = ""
 
         execution_envs = ["Docker", "Native"]
+        run_modes = ["performance-only", "accuracy-only"]
         code_version = "r5.0-dev"
         implementation_run_options = []
 
@@ -186,6 +187,7 @@ def mlperf_inference_implementation_readme(
                     cur_space2 = cur_space1 + "    "
                     cur_space3 = cur_space2 + "    "
                     cur_space4 = cur_space3 + "    "
+                    cur_space5 = cur_space4 + "    "
 
                     content += f"{cur_space1}=== \"{device}\"\n"
                     content += f"{cur_space2}##### {device} device\n\n"
@@ -305,6 +307,8 @@ def mlperf_inference_implementation_readme(
 
                                 if implementation.lower() == "nvidia":
                                     content += f"{cur_space3}* `--gpu_name=<Name of the GPU>` : The GPUs with supported configs in MLC are `orin`, `rtx_4090`, `rtx_a6000`, `rtx_6000_ada`, `l4`, `t4`and `a100`. For other GPUs, default configuration as per the GPU memory will be used.\n"
+                                    if "llama2-70b" in model.lower():
+                                        content += f"{cur_space3}* Add `--adr.llama2-model.tags=_pre-quantized` to use the Nvidia quantized models with the available in the MLC Storage. These models were quantized with three different configurations of tensor parallelism and pipeline parallelism: TP1–PP2, TP2–PP1, and TP1–PP1. The appropriate model will be automatically selected based on the values provided for `--tp_size` and `--pp_size` in run command. By default tp size of 2 and pp size of 1 would be used.\n"
 
                                 if device.lower() not in ["cuda"]:
                                     content += f"{cur_space3}* `--docker_os=ubuntu`: ubuntu and rhel are supported. \n"
@@ -373,25 +377,27 @@ def mlperf_inference_implementation_readme(
 
                         for scenario in scenarios:
                             content += f"{cur_space3}=== \"{scenario}\"\n{cur_space4}###### {scenario}\n\n"
-                            run_cmd = mlperf_inference_run_command(
-                                spaces + 21,
-                                model,
-                                implementation,
-                                framework.lower(),
-                                category.lower(),
-                                scenario,
-                                device.lower(),
-                                final_run_mode,
-                                test_query_count,
-                                False,
-                                skip_test_query_count,
-                                scenarios,
-                                code_version,
-                                extra_variation_tags,
-                                extra_input_string,
-                            )
-                            content += run_cmd
-                            # content += run_suffix
+                            for run_mode in run_modes:
+                                content += f"{cur_space4}=== \"{run_mode}\"\n{cur_space5}###### {run_mode}\n\n"
+                                run_cmd = mlperf_inference_run_command(
+                                    spaces + 25,
+                                    model,
+                                    implementation,
+                                    framework.lower(),
+                                    category.lower(),
+                                    scenario,
+                                    device.lower(),
+                                    final_run_mode,
+                                    test_query_count,
+                                    False,
+                                    skip_test_query_count,
+                                    scenarios,
+                                    code_version,
+                                    extra_variation_tags + f",_{run_mode}",
+                                    extra_input_string,
+                                )
+                                content += run_cmd
+                                # content += run_suffix
 
                         if len(scenarios) > 1:
                             content += f"{cur_space3}=== \"All Scenarios\"\n{cur_space4}###### All Scenarios\n\n"
@@ -481,7 +487,7 @@ def get_min_system_requirements(spaces, model, implementation, device):
             ds = {
                 "dlrm": "500GB",
                 "pointpainting": "500GB",
-                "llama2-70b": "600GB",
+                "llama2-70b": "900GB",
                 "llama3_1-405b": "2.3TB",
                 "mixtral": "100GB",
                 "retinanet": "200GB",
@@ -498,7 +504,12 @@ def get_min_system_requirements(spaces, model, implementation, device):
                     disk_space = ds[key]
                     break
 
+        if "llama2" in model.lower():
+            disk_space = f" 900GB for manual execution of {"reference" if implementation.lower() == "reference" else "vendor"} implementation and 1.5TB for automated run through MLC-Scripts"
+
+        if implementation.lower() == "reference" or "llama2" in model.lower():
             min_sys_req_content += f"{spaces}* **Disk Space**: {disk_space}\n\n"
+
         # System memory
         if "dlrm" in model:
             system_memory = "512GB"
diff --git a/mkdocs.yml b/mkdocs.yml
index 730d7eff2d..ad1feae91a 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -29,7 +29,10 @@ nav:
       - Image Classification:
           - ResNet50: benchmarks/image_classification/resnet50.md
       - Text to Image:
-          - Stable Diffusion: benchmarks/text_to_image/sdxl.md
+          - Stable Diffusion: 
+            - Run Commands: benchmarks/text_to_image/sdxl.md
+            - External Use:
+              - SCC24 Guide: benchmarks/text_to_image/reproducibility/scc24.md
       - 2D Object Detection:
           - RetinaNet: benchmarks/object_detection/retinanet.md
       - Automotive:
@@ -40,7 +43,10 @@ nav:
       - Language Processing:
         - Bert-Large: benchmarks/language/bert.md
         - GPT-J: benchmarks/language/gpt-j.md
-        - LLAMA2-70B: benchmarks/language/llama2-70b.md
+        - LLAMA2-70B: 
+          - Run Commands: benchmarks/language/llama2-70b.md
+          - External Use:
+            - SCC25 Guide: benchmarks/language/scc25_guide/scc25.md
         - LLAMA3-405B: benchmarks/language/llama3_1-405b.md
         - LLAMA3-8B: benchmarks/language/llama3_1-8b.md
         - MIXTRAL-8x7B: benchmarks/language/mixtral-8x7b.md

From cba86287613416eb12748b2352b18e7a0767c11e Mon Sep 17 00:00:00 2001
From: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Date: Wed, 10 Sep 2025 22:19:55 +0530
Subject: [PATCH 20/22] fix for fstring (#2332)

---
 main.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/main.py b/main.py
index 37c70c9d07..4e0d6ee25f 100755
--- a/main.py
+++ b/main.py
@@ -505,7 +505,7 @@ def get_min_system_requirements(spaces, model, implementation, device):
                     break
 
         if "llama2" in model.lower():
-            disk_space = f" 900GB for manual execution of {"reference" if implementation.lower() == "reference" else "vendor"} implementation and 1.5TB for automated run through MLC-Scripts"
+            disk_space = f" 900GB for manual execution of {'reference' if implementation.lower() == 'reference' else 'vendor'} implementation and 1.5TB for automated run through MLC-Scripts"
 
         if implementation.lower() == "reference" or "llama2" in model.lower():
             min_sys_req_content += f"{spaces}* **Disk Space**: {disk_space}\n\n"

From a9aef732a2adbab12d21dffd6946bcb365a3f071 Mon Sep 17 00:00:00 2001
From: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Date: Thu, 11 Sep 2025 13:49:43 +0530
Subject: [PATCH 21/22] Updation of automation run commands - v5.1_dev (#2333)

* Updation of automation run commands - v5.1_dev

* Update main.py

* llama2 dataset download is handled through automation
---
 docs/benchmarks/language/scc25_guide/scc25.md | 2 +-
 main.py                                       | 6 +-----
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/docs/benchmarks/language/scc25_guide/scc25.md b/docs/benchmarks/language/scc25_guide/scc25.md
index 618cfcd74f..4b44fb065d 100644
--- a/docs/benchmarks/language/scc25_guide/scc25.md
+++ b/docs/benchmarks/language/scc25_guide/scc25.md
@@ -58,7 +58,7 @@ or supporting multi-node execution) useful for the community and [MLCommons](htt
     
 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
 
-{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.99", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
+{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
 
 === "Nvidia"
     ## Nvidia MLPerf Implementation
diff --git a/main.py b/main.py
index 4e0d6ee25f..907f09af1a 100755
--- a/main.py
+++ b/main.py
@@ -68,7 +68,7 @@ def mlperf_inference_implementation_readme(
 
         elif implementation == "nvidia":
             if model in ["retinanet", "resnet50",
-                         "3d-unet-99", "3d-unet-99.9"]:
+                         "3d-unet-99", "3d-unet-99.9", "llama2-70b-99", "llama2-70b-99.9"]:
                 code_version = "r5.1-dev"
             if model in ["mixtral-8x7b"]:
                 return pre_space + "    WIP"
@@ -594,9 +594,6 @@ def get_docker_info(spaces, model, implementation,
             if implementation.lower() == "nvidia":
                 info += f"{pre_space}    - Default batch size is assigned based on [GPU memory](https://github.com/mlcommons/cm4mlops/blob/dd0c35856969c68945524d5c80414c615f5fe42c/script/app-mlperf-inference-nvidia/_cm.yaml#L1129) or the [specified GPU](https://github.com/mlcommons/cm4mlops/blob/dd0c35856969c68945524d5c80414c615f5fe42c/script/app-mlperf-inference-nvidia/_cm.yaml#L1370). Please click more option for *docker launch* or *run command* to see how to specify the GPU name.\n\n"
                 info += f"{pre_space}    - When run with `--all_models=yes`, all the benchmark models of NVIDIA implementation can be executed within the same container.\n\n"
-                if "llama2" in model.lower():
-                    info += f"{pre_space}    - The dataset for NVIDIA's implementation of Llama2 is not publicly available. The user must fill [this](https://docs.google.com/forms/d/e/1FAIpQLSc_8VIvRmXM3I8KQaYnKf7gy27Z63BBoI_I1u02f4lw6rBp3g/viewform?pli=1&fbzx=-8842630989397184967) form and be verified as a MLCommons member to access the dataset.\n\n"
-                    info += f"{pre_space}    - `PATH_TO_PICKE_FILE` should be replaced with path to the downloaded pickle file.\n\n"
         else:
             if model == "sdxl":
                 info += f"\n{pre_space}!!! tip\n\n"
@@ -742,7 +739,6 @@ def mlperf_inference_run_command(
             if "llama2-70b" in model.lower():
                 if implementation == "nvidia":
                     docker_cmd_suffix += f" \\\n{pre_space} --tp_size=2"
-                    docker_cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKLE_FILE>"
                 elif implementation == "neuralmagic":
                     docker_cmd_suffix += (
                         f" \\\n{pre_space} --api_server=http://localhost:8000"

From 083828d20f9ef36ee6237e15294439200351affb Mon Sep 17 00:00:00 2001
From: ANANDHU S <71482562+anandhu-eng@users.noreply.github.com>
Date: Mon, 15 Sep 2025 00:31:18 +0530
Subject: [PATCH 22/22] Fixes for docs (#2334)

---
 docs/benchmarks/language/scc25_guide/scc25.md | 4 ----
 main.py                                       | 1 -
 2 files changed, 5 deletions(-)

diff --git a/docs/benchmarks/language/scc25_guide/scc25.md b/docs/benchmarks/language/scc25_guide/scc25.md
index 4b44fb065d..69174e34a9 100644
--- a/docs/benchmarks/language/scc25_guide/scc25.md
+++ b/docs/benchmarks/language/scc25_guide/scc25.md
@@ -58,15 +58,11 @@ or supporting multi-node execution) useful for the community and [MLCommons](htt
     
 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
 
-{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
-
 === "Nvidia"
     ## Nvidia MLPerf Implementation
 
 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "nvidia", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
 
-{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.99", "nvidia", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
-
 ## Submission Commands
 
 ### Generate actual submission tree
diff --git a/main.py b/main.py
index 907f09af1a..91f0223031 100755
--- a/main.py
+++ b/main.py
@@ -786,7 +786,6 @@ def mlperf_inference_run_command(
             if "llama2-70b" in model.lower():
                 if implementation == "nvidia":
                     cmd_suffix += f" \\\n{pre_space} --tp_size=<TP_SIZE>"
-                    cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>"
                 elif implementation == "neuralmagic":
                     cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000"
                     cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8"