mlcommons · anivar · Jul 20, 2025 · Jul 29, 2025 · Jul 29, 2025 · Jul 29, 2025
@@ -0,0 +1,127 @@
+# Dataset Preprocessing Documentation Template
+
+## Purpose
+This template provides a standardized way to document dataset preprocessing steps for MLCommons inference benchmarks, ensuring reproducibility and transparency.
+
+## Template Structure
+
+### Model: [MODEL_NAME]
+**Dataset:** [DATASET_NAME]  
+**Evaluation Task:** [TASK_DESCRIPTION]
+
+#### Data Source
+- **Raw Dataset:** [SOURCE_AND_FORMAT]
+- **Download Method:** [HOW_TO_OBTAIN]
+- **License:** [LICENSE_INFO]
+
+#### Preprocessing Pipeline
+
+##### 1. Tokenization
+```python
+# Example based on llama2-70b/processorca.py pattern
+from transformers import [TOKENIZER_CLASS]
+tokenizer = [TOKENIZER_CLASS].from_pretrained(model_dir)
+tokens = tokenizer(text)["input_ids"]
+```
+
+##### 2. Filtering Steps
+- **Language Filter:** [DESCRIPTION]
+- **Length Filter:** [SEQUENCE_LENGTH_LIMITS]
+- **Quality Filter:** [QUALITY_CRITERIA]
+- **Content Filter:** [CONTENT_RESTRICTIONS]
+
+##### 3. Formatting
+- **Input Format:** [INPUT_TEMPLATE]
+- **Output Format:** [OUTPUT_TEMPLATE]
+- **Special Tokens:** [SPECIAL_TOKEN_HANDLING]
+
+##### 4. Sampling Strategy
+- **Total Samples:** [NUMBER]
+- **Sampling Method:** [RANDOM/STRATIFIED/OTHER]
+- **Validation Split:** [IF_APPLICABLE]
+
+#### Adaptation Guide
+**For Different Tokenizers:**
+- Modify tokenizer initialization
+- Adjust sequence length limits
+- Update special token handling
+
+**For Different Models:**
+- Update input/output templates
+- Adjust filtering criteria
+- Modify prompt formatting
+
+#### Files Generated
+- **Main Dataset:** [FILENAME_AND_FORMAT]
+- **Calibration Set:** [FILENAME_AND_FORMAT]
+- **Metadata:** [FILENAME_AND_FORMAT]
+
+#### Verification
+- **Expected Sample Count:** [NUMBER]
+- **Checksum/Hash:** [IF_AVAILABLE]
+- **Quality Metrics:** [ROUGE/BLEU/OTHER]
+
+---
+
+## Example Applications
+
+### Llama3.1-8b (CNN/DailyMail)
+**Dataset:** CNN/DailyMail 3.0.0  
+**Evaluation Task:** Text Summarization
+
+#### Data Source
+- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
+- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0")`
+- **License:** Apache 2.0
+
+#### Preprocessing Pipeline
+##### 1. Tokenization
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+##### 2. Formatting
+- **Input Template:** 
+```
+Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
+
+Article:
+{article}
+
+Summary:
+```
+
+##### 3. Current Gaps
+- ❌ No documented filtering steps
+- ❌ No sampling strategy explanation  
+- ❌ No quality control measures
+- ❌ No reproducible preprocessing script
+
+### DeepSeek-r1 (Multi-domain Evaluation)
+**Dataset:** Ensemble of AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench  
+**Evaluation Task:** Multi-domain Reasoning
+
+#### Data Source
+- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2
+- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
+- **License:** Various (CC0, MIT, CC BY 4.0)
+
+#### Current Gaps
+- ❌ No documented preprocessing steps
+- ❌ No tokenization details
+- ❌ No filtering or sampling explanation
+- ❌ No adaptation guide for other models
+- ❌ Cannot reproduce from raw sources
+
+---
+
+## Implementation Recommendation
+
+1. **For each model directory**, add `PREPROCESSING.md` following this template
+2. **For models with preprocessing scripts**, document the steps in the README
+3. **For models using preprocessed data**, provide original preprocessing methodology
+4. **Create common utilities** for preprocessing patterns that can be shared across models
@@ -0,0 +1,139 @@
+# MLCommons Inference - General Preprocessing Guide
+
+## Overview
+
+This guide covers common preprocessing patterns across all language models in MLCommons Inference benchmarks. Preprocessing varies by:
+1. Model architecture
+2. Backend choice (PyTorch, vLLM, SGLang)
+3. Task type (summarization, Q&A, etc.)
+
+## Common Tokenizer Setup Pattern
+
+Most models follow this pattern:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.padding_side = "left"  # Critical for generation
+tokenizer.pad_token = tokenizer.eos_token
+```
+
+## Backend Dependencies
+
+Different backends have different preprocessing requirements:
+
+| Backend | Input Type | Chat Template Support | Use Case |
+|---------|------------|---------------------|----------|
+| PyTorch | Tokenized | Varies by model | Distributed inference |
+| vLLM | Text | Varies by model | High-throughput serving |
+| SGLang | Text | Usually disabled | Optimized serving |
+
+## Dataset Format
+
+All models expect datasets with these common fields:
+
+```python
+{
+    'text_input': str,      # Raw prompt text (required)
+    'tok_input': List[int], # Pre-tokenized input (optional)
+    'output': str,          # Expected output for evaluation
+}
+```
+
+## Model-Specific Preprocessing
+
+### Models Using Chat Templates
+- **DeepSeek-R1**: Uses `apply_chat_template` with PyTorch/vLLM
+- **Potential others**: Check `uses_chat_template` in backend registry
+
+### Models Using Simple Templates
+- **Llama 3.1-8B**: Instruction format for summarization
+- **Llama 2-70B**: Custom format with `[INST]` markers
+- **Mixtral-8x7B**: Simple instruction format
+
+### Models Using Raw Prompts
+- **GPT-J**: Completion-style, no special formatting
+
+## Preprocessing Steps
+
+1. **Load the tokenizer** with appropriate configuration
+2. **Apply model-specific formatting** (chat template or instruction format)
+3. **Tokenize** with proper truncation and max length
+4. **Handle padding** (left-side for generation models)
+
+## Example: Generic Preprocessing Function
+
+```python
+def preprocess_for_model(text, model_name, backend="pytorch"):
+    """Generic preprocessing based on model and backend"""
+
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # Check if chat template should be used
+    if should_use_chat_template(model_name, backend):
+        tokens = tokenizer.apply_chat_template(
+            [{"role": "user", "content": text}],
+            add_generation_prompt=True,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+    else:
+        # Apply model-specific template or use raw text
+        formatted_text = apply_model_template(text, model_name)
+        tokens = tokenizer.encode(
+            formatted_text,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+
+    return tokens
+```
+
+## Max Context Lengths
+
+| Model | Max Length | Notes |
+|-------|------------|-------|
+| DeepSeek-R1 | 32,768 | 32K context |
+| Llama 3.1-8B | 8,000 | For preprocessing |
+| Llama 2-70B | 1,024 | Limited context |
+| Mixtral-8x7B | 1,024 | From dataset.py |
+| GPT-J | ~2,048 | Standard GPT-J limit |
+
+## Running Inference
+
+```bash
+# Set backend
+export MLPERF_BACKEND=pytorch  # or vllm, sglang
+
+# PyTorch backend (distributed)
+torchrun --nproc_per_node=8 run_eval_mpi.py --input-file data.pkl
+
+# vLLM/SGLang backends
+python run_eval.py --input-file data.pkl
+```
+
+## Common Issues
+
+1. **Wrong padding side**: Always use `padding_side="left"` for generation
+2. **Missing pad token**: Set `pad_token = eos_token`
+3. **Backend mismatch**: Ensure preprocessing matches backend requirements
+4. **Context overflow**: Respect model's maximum context length
+
+## Validation
+
+To ensure correct preprocessing:
+
+1. Check tokenized length doesn't exceed max
+2. Verify special tokens are properly placed
+3. Test with a few examples before full dataset
+4. Compare against reference outputs
+
+## References
+
+- Model-specific guides in each model's directory
+- Backend configuration in `utils/backend_registry.py`
+- Tokenization utilities in `utils/tokenization.py`
@@ -0,0 +1,56 @@
+# DeepSeek-R1 Preprocessing
+
+## Model Configuration
+- **Model**: `deepseek-ai/DeepSeek-R1`
+- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad`
+- **Max Length**: 32,768 tokens (32K)
+
+## Tokenization
+```python
+from transformers import AutoTokenizer
+
+# From utils/tokenization.py
+tokenizer = AutoTokenizer.from_pretrained(
+    "deepseek-ai/DeepSeek-R1",
+    revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
+)
+```
+
+## Preprocessing Method
+
+The preprocessing varies by backend:
+
+### PyTorch/vLLM Backends (Chat Template Enabled)
+```python
+# From utils/tokenization.py
+tokens = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt}],
+    add_generation_prompt=True,
+    max_length=32768,
+    truncation=True
+)
+```
+
+### SGLang Backend (No Chat Template)
+```python
+tokens = tokenizer.encode(
+    prompt,
+    truncation=True,
+    max_length=32768
+)
+```
+
+## Backend Configuration
+| Backend | uses_chat_template | input_type |
+|---------|-------------------|------------|
+| PyTorch | True | tokenized |
+| vLLM | True | text |
+| SGLang | False | text |
+
+## Dataset Format
+Input data should have a `text_input` column containing the prompts.
+
+## Accuracy Target
+```
+"mean-accuracy": 81.3582
+```
@@ -0,0 +1,47 @@
+# Llama 3.1 8B Preprocessing
+
+## Model Configuration
+- **Model**: `meta-llama/Llama-3.1-8B-Instruct`
+- **Revision**: `be673f326cab4cd22ccfef76109faf68e41aa5f1` (for download)
+- **Max Length**: 8,000 tokens (in preprocessing scripts)
+
+## Tokenization
+```python
+from transformers import AutoTokenizer
+
+# From prepare-calibration.py
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+## Prompt Template (CNN/DailyMail Summarization)
+```python
+# From prepare-calibration.py and download_cnndm.py
+instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:"
+
+# Tokenize
+x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
+```
+
+**Note**: This uses a simple instruction format, NOT the chat template with special tokens.
+
+## Dataset Preparation
+```python
+# Example from prepare-calibration.py
+x = dict()
+x["instruction"] = instruction_template
+x["input"] = calibration_sample["article"]
+x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
+x["output"] = calibration_sample["highlights"]
+```
+
+## Accuracy Targets (BF16)
+```
+Datacenter:
+- rouge1: 38.7792
+- rouge2: 15.9075
+- rougeL: 24.4957
+- rougeLsum: 35.793
+```