Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions .agents/skills/add-dataset/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
name: add-dataset
description: Guide for adding a new dataset to veRL. Use when user wants to preprocess or integrate a new RL training dataset.
---

# Add Dataset

Add a new dataset for RL training in veRL.

## When to Use

This skill is triggered when:

- User asks "how do I add a dataset?"
- User wants to preprocess a new dataset for GRPO/PPO training
- User mentions creating a data preprocessing script

## Overview

veRL datasets follow a two-step pattern:

1. **Preprocessing script** (`examples/data_preprocess/<name>.py`) — run once offline to
convert raw data into parquet files with a fixed schema
2. **`RLDataset`** (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that
reads the parquet files; you usually do NOT need to modify this

## Required Schema

Every preprocessed sample must contain these fields:

```python
{
"data_source": str, # Identifies the reward function to use, e.g. "gsm8k"
"prompt": list[dict], # Chat messages, e.g. [{"role": "user", "content": "..."}]
"ability": str, # Task category, e.g. "math", "coding", "qa"
"reward_model": {
"style": "rule", # "rule" for compute_score, "model" for reward model
"ground_truth": str, # Ground truth answer (used by compute_score)
},
"extra_info": {
"split": str, # "train" or "test"
"index": int, # Sample index for reproducibility
},
}
```

## Step-by-Step Guide

### Step 1: Create Preprocessing Script

Create `examples/data_preprocess/<name>.py`:

```python
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License")

import argparse
import os

import datasets
from verl.utils.hdfs_io import copy, makedirs

data_source = "<name>" # Must match key in default_compute_score

SYSTEM_PROMPT = "You are a helpful assistant." # Optional


def make_map_fn(split: str):
def process_fn(example: dict, idx: int) -> dict:
# Build the prompt as a chat template
question = example["question"] # adapt to your dataset fields
answer = str(example["answer"])

messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]

return {
"data_source": data_source,
"prompt": messages,
"ability": "math", # change as appropriate
"reward_model": {
"style": "rule",
"ground_truth": answer,
},
"extra_info": {
"split": split,
"index": idx,
},
}

return process_fn


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_dir", default="~/data/<name>")
parser.add_argument("--hdfs_dir", default=None)
args = parser.parse_args()

local_dir = os.path.expanduser(args.local_dir)
os.makedirs(local_dir, exist_ok=True)

# Load from HuggingFace hub or local path
dataset = datasets.load_dataset("<hf_dataset_id>")

for split in ["train", "test"]:
if split not in dataset:
continue
processed = dataset[split].map(
function=make_map_fn(split),
with_indices=True,
remove_columns=dataset[split].column_names,
)
output_path = os.path.join(local_dir, f"{split}.parquet")
processed.to_parquet(output_path)
print(f"Saved {len(processed)} samples to {output_path}")

if args.hdfs_dir:
makedirs(args.hdfs_dir)
copy(src=local_dir, dst=args.hdfs_dir)
```

### Step 2: Run Preprocessing

```bash
python examples/data_preprocess/<name>.py --local_dir ~/data/<name>
```

### Step 3: Update Training Config

Point the trainer to your new dataset:

```bash
# In your run script or YAML config:
data.train_files=~/data/<name>/train.parquet
data.val_files=~/data/<name>/test.parquet
data.train_data_sources=<name> # optional, for filtering
```

### Step 4: Register Reward Function

Make sure `verl/utils/reward_score/__init__.py` handles your `data_source`:

```python
elif data_source == "<name>":
from verl.utils.reward_score.<name> import compute_score
return compute_score(solution_str, ground_truth)
```

See the `/add-reward` skill for details.

## Reference Implementations

| Dataset | File | Notes |
| ------------ | --------------------------------------------- | ------------------------------ |
| GSM8K | `examples/data_preprocess/gsm8k.py` | Math QA baseline |
| MATH | `examples/data_preprocess/math_dataset.py` | Competition math |
| Geo3K | `examples/data_preprocess/geo3k.py` | Geometry |
| Multi-turn | `verl/utils/dataset/multiturn_sft_dataset.py` | SFT with multi-turn dialogues |

## Key Requirements

1. **Fixed schema**: All required fields must be present (see above)
2. **Parquet format**: veRL's `RLDataset` expects `.parquet` files
3. **data_source matches**: Must align with your reward function's dispatch key
4. **Prompt as chat messages**: Use list-of-dicts format, not a raw string

## Common Mistakes

- ❌ Missing `reward_model.ground_truth` field (reward manager will crash)
- ❌ Prompt as a raw string instead of chat message list
- ❌ `data_source` mismatch with reward function
- ❌ Forgetting to remove original dataset columns after `map()`

<!--
================================================================================
MAINTAINER GUIDE
================================================================================
Location: .agents/skills/add-dataset/SKILL.md

## How to Update
- When RLDataset schema changes: update Required Schema section
- When new reference datasets added: update table
================================================================================
-->
198 changes: 198 additions & 0 deletions .agents/skills/add-reward/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
name: add-reward
description: Guide for adding a new reward function to veRL. Use when user wants to create a reward (compute_score) function.
---

# Add Reward

Add a new reward function to veRL.

## When to Use

This skill is triggered when:

- User asks "how do I add a reward function?"
- User wants to implement custom reward scoring
- User mentions `compute_score` or reward verification

## Overview

veRL separates reward functions into two layers:

1. **`compute_score` function** (`verl/utils/reward_score/<name>.py`) — pure Python,
takes decoded strings and returns a float score
2. **`RewardManager`** (`verl/workers/reward_manager/`) — wraps compute_score, handles
batching, decoding, and DataProto interface

For most use cases, you only need to implement a `compute_score` function and register
it. A custom `RewardManager` is only needed for advanced use cases (e.g., remote reward
models, PRIME).

## Step-by-Step Guide

### Step 1: Create the compute_score Function

Create `verl/utils/reward_score/<name>.py`:

```python
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0

import re
from typing import Any


def compute_score(solution_str: str, ground_truth: Any) -> float:
"""Compute reward score for a single completion.

Args:
solution_str: Decoded model output string (prompt + completion).
ground_truth: Ground truth answer from the dataset.

Returns:
Float score, typically in [0.0, 1.0].
"""
try:
answer = _extract_answer(solution_str)
if answer is not None and _is_correct(answer, str(ground_truth)):
return 1.0
return 0.0
except Exception:
return 0.0


def _extract_answer(solution_str: str) -> str | None:
"""Extract answer from model output. Customize this logic."""
# Example: extract content from \boxed{}
match = re.search(r"\\boxed\{([^}]+)\}", solution_str)
if match:
return match.group(1).strip()
return None


def _is_correct(predicted: str, ground_truth: str) -> bool:
"""Check if the predicted answer matches ground truth."""
return predicted.strip() == ground_truth.strip()
```

### Step 2: Register in default_compute_score

Update `verl/utils/reward_score/__init__.py` to include your new data source:

```python
from verl.utils.reward_score.<name> import compute_score as <name>_compute_score

def default_compute_score(data_source, solution_str, ground_truth, extra_info=None):
# ... existing cases ...
elif data_source == "<your_dataset_name>":
return <name>_compute_score(solution_str, ground_truth)
else:
raise NotImplementedError(f"Unknown data_source: {data_source}")
```

### Step 3: Set data_source in Dataset Preprocessing

In your data preprocessing script, set the `data_source` field to match:

```python
# In data_preprocess/<name>.py
data_source = "<your_dataset_name>"

def make_map_fn(split):
def process_fn(example, idx):
return {
"data_source": data_source,
"prompt": [...], # list of chat messages
"ability": "math", # task category
"reward_model": {
"style": "rule",
"ground_truth": example["answer"],
},
"extra_info": {...},
}
return process_fn
```

### Step 4: Wire into Training Config

In your training script (e.g., `examples/grpo_trainer/run_qwen2-7b_math.sh`):

```bash
# Use NaiveRewardManager (default) with your compute_score
reward_model.reward_manager=naive
```

Or in the Python trainer config:

```python
# Trainer will call NaiveRewardManager which calls default_compute_score
# which dispatches to your function based on data_source
```

### Step 5 (Optional): Custom RewardManager

Only needed if `NaiveRewardManager` is insufficient (e.g., remote reward model,
process-level rewards like PRIME). Subclass `AbstractRewardManager`:

```python
from verl.workers.reward_manager import register
from verl.workers.reward_manager.abstract import AbstractRewardManager
from verl import DataProto
import torch


@register("<name>")
class MyRewardManager(AbstractRewardManager):
def __init__(self, tokenizer, num_examine, compute_score=None, reward_fn_key="data_source", **kwargs):
self.tokenizer = tokenizer
self.num_examine = num_examine
self.compute_score = compute_score

def __call__(self, data: DataProto, return_dict: bool = False) -> torch.Tensor:
# Decode responses, call compute_score, return reward tensor
...
```

Then reference it in config: `reward_model.reward_manager=<name>`

## Reference Implementations

| Reward | File | Description |
| ------------- | ---------------------------------------- | ----------------------------- |
| GSM8K | `verl/utils/reward_score/gsm8k.py` | Math answer extraction |
| Math general | `verl/utils/reward_score/math_reward.py` | LaTeX boxed answer matching |
| Geo3K | `verl/utils/reward_score/geo3k.py` | Geometry answer verification |
| PRIME | `verl/workers/reward_manager/prime.py` | Process reward model |
| DAPO | `verl/workers/reward_manager/dapo.py` | DAPO-style reward shaping |

## Key Requirements

1. **Return float**: `compute_score` returns a single float per sample
2. **No side effects**: Function must be deterministic and stateless
3. **Handle exceptions**: Return `0.0` on error, do not raise
4. **data_source matches**: The string in dataset must match the dispatch key in `__init__.py`

## Common Mistakes

- ❌ Raising exceptions inside `compute_score` (causes worker crash)
- ❌ `data_source` mismatch between dataset and `default_compute_score`
- ❌ Returning a tensor instead of a float
- ❌ Assuming solution_str contains only the completion (it includes the prompt too)

<!--
================================================================================
MAINTAINER GUIDE
================================================================================
Location: .agents/skills/add-reward/SKILL.md

## How to Update
- When reward_score API changes: update Step 1 signature
- When RewardManager API changes: update Step 5
- When new reference implementations added: update table
================================================================================
-->
Loading