Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d02b7cb
Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b
anivar Jul 20, 2025
47863bc
Fix llama3.1-8b metric and dataset (#2300)
pgmpablo157321 Jul 29, 2025
7c394ae
Add interactive scenario to llama3.1 models (#2299)
pgmpablo157321 Jul 29, 2025
b966d58
Allow more flexible datatypes in measurements file (#2298)
pgmpablo157321 Jul 29, 2025
1ba6f9e
Update evaluation.py (#2303)
taran2210 Jul 29, 2025
db2d63d
Only require Server or Interactive for closed (#2304)
pgmpablo157321 Jul 29, 2025
e3e030f
[Whisper] Adding n_token return for compliance fix (#2305)
keithachorn-intel Jul 30, 2025
e86ddca
Fix checking power directory (#2306)
anandhu-eng Jul 30, 2025
1ccb2b1
Only check for token latency requirements for server scenario (#2313)
pgmpablo157321 Jul 31, 2025
e41f484
Use server SUT for SingleStream (#2314)
pgmpablo157321 Aug 1, 2025
c6aa29a
Merge branch 'master' into fix/preprocessing-documentation
anivar Aug 2, 2025
28c2fde
Update the default value for repository arg (#2317)
anandhu-eng Aug 4, 2025
71da6c8
Update preprocess_submission.py | Skip inferring offline scenario if …
arjunsuresh Aug 6, 2025
62bebd7
Fix: add llama3.1-8b-edge to generate_final_report (#2319)
pgmpablo157321 Aug 6, 2025
e051a9f
Allow lowercase 'interactive' as scenario name (#2315)
psyhtest Aug 7, 2025
8583a96
Use sample latency as the metric for llama3.1_8b_edge SingleStream (#…
pgmpablo157321 Aug 20, 2025
1161b2d
Remove rclone references and update download instructions for DeepSee…
anivar Aug 20, 2025
aa8b5da
Fix preprocessing documentation with verified implementations
anivar Aug 21, 2025
12391fa
Merge branch 'master' into fix/preprocessing-documentation
arjunsuresh Aug 21, 2025
a2aaa91
hide long time untested implementations from docs (#2328)
anandhu-eng Sep 1, 2025
5f8019a
Initial draft for SCC 25 documentation (#2331)
anandhu-eng Sep 10, 2025
cba8628
fix for fstring (#2332)
anandhu-eng Sep 10, 2025
a9aef73
Updation of automation run commands - v5.1_dev (#2333)
anandhu-eng Sep 11, 2025
083828d
Fixes for docs (#2334)
anandhu-eng Sep 14, 2025
17aa77f
Merge branch 'master' into fix/preprocessing-documentation
anivar Sep 15, 2025
141f5b6
Merge branch 'master' into fix/preprocessing-documentation
anivar Sep 15, 2025
4d6ba8c
Merge branch 'master' into fix/preprocessing-documentation
hanyunfan Oct 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions PREPROCESSING-TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Dataset Preprocessing Documentation Template

## Purpose
This template provides a standardized way to document dataset preprocessing steps for MLCommons inference benchmarks, ensuring reproducibility and transparency.

## Template Structure

### Model: [MODEL_NAME]
**Dataset:** [DATASET_NAME]
**Evaluation Task:** [TASK_DESCRIPTION]

#### Data Source
- **Raw Dataset:** [SOURCE_AND_FORMAT]
- **Download Method:** [HOW_TO_OBTAIN]
- **License:** [LICENSE_INFO]

#### Preprocessing Pipeline

##### 1. Tokenization
```python
# Example based on llama2-70b/processorca.py pattern
from transformers import [TOKENIZER_CLASS]
tokenizer = [TOKENIZER_CLASS].from_pretrained(model_dir)
tokens = tokenizer(text)["input_ids"]
```

##### 2. Filtering Steps
- **Language Filter:** [DESCRIPTION]
- **Length Filter:** [SEQUENCE_LENGTH_LIMITS]
- **Quality Filter:** [QUALITY_CRITERIA]
- **Content Filter:** [CONTENT_RESTRICTIONS]

##### 3. Formatting
- **Input Format:** [INPUT_TEMPLATE]
- **Output Format:** [OUTPUT_TEMPLATE]
- **Special Tokens:** [SPECIAL_TOKEN_HANDLING]

##### 4. Sampling Strategy
- **Total Samples:** [NUMBER]
- **Sampling Method:** [RANDOM/STRATIFIED/OTHER]
- **Validation Split:** [IF_APPLICABLE]

#### Adaptation Guide
**For Different Tokenizers:**
- Modify tokenizer initialization
- Adjust sequence length limits
- Update special token handling

**For Different Models:**
- Update input/output templates
- Adjust filtering criteria
- Modify prompt formatting

#### Files Generated
- **Main Dataset:** [FILENAME_AND_FORMAT]
- **Calibration Set:** [FILENAME_AND_FORMAT]
- **Metadata:** [FILENAME_AND_FORMAT]

#### Verification
- **Expected Sample Count:** [NUMBER]
- **Checksum/Hash:** [IF_AVAILABLE]
- **Quality Metrics:** [ROUGE/BLEU/OTHER]

---

## Example Applications

### Llama3.1-8b (CNN/DailyMail)
**Dataset:** CNN/DailyMail 3.0.0
**Evaluation Task:** Text Summarization

#### Data Source
- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0")`
- **License:** Apache 2.0

#### Preprocessing Pipeline
##### 1. Tokenization
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 8000
```

##### 2. Formatting
- **Input Template:**
```
Summarize the following news article in 128 tokens. Please output the summary only, without any other text.

Article:
{article}

Summary:
```

##### 3. Current Gaps
- ❌ No documented filtering steps
- ❌ No sampling strategy explanation
- ❌ No quality control measures
- ❌ No reproducible preprocessing script

### DeepSeek-r1 (Multi-domain Evaluation)
**Dataset:** Ensemble of AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench
**Evaluation Task:** Multi-domain Reasoning

#### Data Source
- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2
- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
- **License:** Various (CC0, MIT, CC BY 4.0)

#### Current Gaps
- ❌ No documented preprocessing steps
- ❌ No tokenization details
- ❌ No filtering or sampling explanation
- ❌ No adaptation guide for other models
- ❌ Cannot reproduce from raw sources

---

## Implementation Recommendation

1. **For each model directory**, add `PREPROCESSING.md` following this template
2. **For models with preprocessing scripts**, document the steps in the README
3. **For models using preprocessed data**, provide original preprocessing methodology
4. **Create common utilities** for preprocessing patterns that can be shared across models
139 changes: 139 additions & 0 deletions language/PREPROCESSING_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# MLCommons Inference - General Preprocessing Guide

## Overview

This guide covers common preprocessing patterns across all language models in MLCommons Inference benchmarks. Preprocessing varies by:
1. Model architecture
2. Backend choice (PyTorch, vLLM, SGLang)
3. Task type (summarization, Q&A, etc.)

## Common Tokenizer Setup Pattern

Most models follow this pattern:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left" # Critical for generation
tokenizer.pad_token = tokenizer.eos_token
```

## Backend Dependencies

Different backends have different preprocessing requirements:

| Backend | Input Type | Chat Template Support | Use Case |
|---------|------------|---------------------|----------|
| PyTorch | Tokenized | Varies by model | Distributed inference |
| vLLM | Text | Varies by model | High-throughput serving |
| SGLang | Text | Usually disabled | Optimized serving |

## Dataset Format

All models expect datasets with these common fields:

```python
{
'text_input': str, # Raw prompt text (required)
'tok_input': List[int], # Pre-tokenized input (optional)
'output': str, # Expected output for evaluation
}
```

## Model-Specific Preprocessing

### Models Using Chat Templates
- **DeepSeek-R1**: Uses `apply_chat_template` with PyTorch/vLLM
- **Potential others**: Check `uses_chat_template` in backend registry

### Models Using Simple Templates
- **Llama 3.1-8B**: Instruction format for summarization
- **Llama 2-70B**: Custom format with `[INST]` markers
- **Mixtral-8x7B**: Simple instruction format

### Models Using Raw Prompts
- **GPT-J**: Completion-style, no special formatting

## Preprocessing Steps

1. **Load the tokenizer** with appropriate configuration
2. **Apply model-specific formatting** (chat template or instruction format)
3. **Tokenize** with proper truncation and max length
4. **Handle padding** (left-side for generation models)

## Example: Generic Preprocessing Function

```python
def preprocess_for_model(text, model_name, backend="pytorch"):
"""Generic preprocessing based on model and backend"""

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

# Check if chat template should be used
if should_use_chat_template(model_name, backend):
tokens = tokenizer.apply_chat_template(
[{"role": "user", "content": text}],
add_generation_prompt=True,
truncation=True,
max_length=get_max_length(model_name)
)
else:
# Apply model-specific template or use raw text
formatted_text = apply_model_template(text, model_name)
tokens = tokenizer.encode(
formatted_text,
truncation=True,
max_length=get_max_length(model_name)
)

return tokens
```

## Max Context Lengths

| Model | Max Length | Notes |
|-------|------------|-------|
| DeepSeek-R1 | 32,768 | 32K context |
| Llama 3.1-8B | 8,000 | For preprocessing |
| Llama 2-70B | 1,024 | Limited context |
| Mixtral-8x7B | 1,024 | From dataset.py |
| GPT-J | ~2,048 | Standard GPT-J limit |

## Running Inference

```bash
# Set backend
export MLPERF_BACKEND=pytorch # or vllm, sglang

# PyTorch backend (distributed)
torchrun --nproc_per_node=8 run_eval_mpi.py --input-file data.pkl

# vLLM/SGLang backends
python run_eval.py --input-file data.pkl
```

## Common Issues

1. **Wrong padding side**: Always use `padding_side="left"` for generation
2. **Missing pad token**: Set `pad_token = eos_token`
3. **Backend mismatch**: Ensure preprocessing matches backend requirements
4. **Context overflow**: Respect model's maximum context length

## Validation

To ensure correct preprocessing:

1. Check tokenized length doesn't exceed max
2. Verify special tokens are properly placed
3. Test with a few examples before full dataset
4. Compare against reference outputs

## References

- Model-specific guides in each model's directory
- Backend configuration in `utils/backend_registry.py`
- Tokenization utilities in `utils/tokenization.py`
56 changes: 56 additions & 0 deletions language/deepseek-r1/PREPROCESSING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# DeepSeek-R1 Preprocessing

## Model Configuration
- **Model**: `deepseek-ai/DeepSeek-R1`
- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad`
- **Max Length**: 32,768 tokens (32K)

## Tokenization
```python
from transformers import AutoTokenizer

# From utils/tokenization.py
tokenizer = AutoTokenizer.from_pretrained(
"deepseek-ai/DeepSeek-R1",
revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
)
```

## Preprocessing Method

The preprocessing varies by backend:

### PyTorch/vLLM Backends (Chat Template Enabled)
```python
# From utils/tokenization.py
tokens = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
max_length=32768,
truncation=True
)
```

### SGLang Backend (No Chat Template)
```python
tokens = tokenizer.encode(
prompt,
truncation=True,
max_length=32768
)
```

## Backend Configuration
| Backend | uses_chat_template | input_type |
|---------|-------------------|------------|
| PyTorch | True | tokenized |
| vLLM | True | text |
| SGLang | False | text |

## Dataset Format
Input data should have a `text_input` column containing the prompts.

## Accuracy Target
```
"mean-accuracy": 81.3582
```
47 changes: 47 additions & 0 deletions language/llama3.1-8b/PREPROCESSING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Llama 3.1 8B Preprocessing

## Model Configuration
- **Model**: `meta-llama/Llama-3.1-8B-Instruct`
- **Revision**: `be673f326cab4cd22ccfef76109faf68e41aa5f1` (for download)
- **Max Length**: 8,000 tokens (in preprocessing scripts)

## Tokenization
```python
from transformers import AutoTokenizer

# From prepare-calibration.py
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 8000
```

## Prompt Template (CNN/DailyMail Summarization)
```python
# From prepare-calibration.py and download_cnndm.py
instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:"

# Tokenize
x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
```

**Note**: This uses a simple instruction format, NOT the chat template with special tokens.

## Dataset Preparation
```python
# Example from prepare-calibration.py
x = dict()
x["instruction"] = instruction_template
x["input"] = calibration_sample["article"]
x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
x["output"] = calibration_sample["highlights"]
```

## Accuracy Targets (BF16)
```
Datacenter:
- rouge1: 38.7792
- rouge2: 15.9075
- rougeL: 24.4957
- rougeLsum: 35.793
```
Loading