Skip to content

Latest commit

 

History

History
339 lines (255 loc) · 7.21 KB

File metadata and controls

339 lines (255 loc) · 7.21 KB

Python API

This guide covers programmatic usage of the evaluation harness in Python scripts and applications.

Overview

The library provides three main ways to run evaluations programmatically:

Function Use Case
simple_evaluate() Most common - accepts model name strings or LM objects
EvaluatorConfig Config-based - load settings from YAML or dataclass
evaluate() Low-level - full control over task dictionaries

Quick Start

The simplest way to run an evaluation:

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["hellaswag"],
)

print(results["results"])

Using simple_evaluate()

The simple_evaluate() function is the recommended entry point for most use cases.

Basic Usage

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2,dtype=float32",
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
)

With a Pre-initialized Model

import lm_eval
from lm_eval.models.huggingface import HFLM

# Initialize model separately
lm = HFLM(pretrained="gpt2", batch_size=16)

results = lm_eval.simple_evaluate(
    model=lm,
    tasks=["hellaswag"],
    num_fewshot=0,
)

With External Tasks

import lm_eval
from lm_eval.tasks import TaskManager

# Include custom task definitions
task_manager = TaskManager(include_path="/path/to/custom/tasks")

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["my_custom_task"],
    task_manager=task_manager,
)

Common Parameters

Parameter Type Description
model str or LM Model name (e.g., "hf", "vllm") or LM instance
model_args str or dict Model constructor arguments
tasks list[str] Task names to evaluate
num_fewshot int Number of few-shot examples
batch_size int or str Batch size or "auto"
device str Device (cuda, cpu, mps)
limit int or float Limit examples per task
log_samples bool Save model inputs/outputs
task_manager TaskManager For external tasks
gen_kwargs dict Generation arguments
apply_chat_template bool or str Use chat template
system_instruction str System prompt
fewshot_as_multiturn bool Multi-turn few-shot

See lm_eval/evaluator.py for the complete parameter list.

Return Value

simple_evaluate() returns a dictionary with:

{
    "results": {
        "task_name": {
            "metric_name": value,
            "metric_name,stderr": stderr_value,
        }
    },
    "configs": {...},      # Task configurations
    "versions": {...},     # Task versions
    "n-shot": {...},       # Few-shot counts
    "higher_is_better": {...},
    "n-samples": {...},
    "samples": {...},      # If log_samples=True
}

Using EvaluatorConfig

The EvaluatorConfig class provides a structured way to manage evaluation settings.

From YAML File

from lm_eval.config.evaluate_config import EvaluatorConfig
import lm_eval

# Load configuration from YAML
config = EvaluatorConfig.from_config("eval_config.yaml")

# Process tasks
task_manager = config.process_tasks()

# Run evaluation
results = lm_eval.simple_evaluate(
    model=config.model,
    model_args=config.model_args,
    tasks=config.tasks,
    num_fewshot=config.num_fewshot,
    batch_size=config.batch_size,
    device=config.device,
    task_manager=task_manager,
    log_samples=config.log_samples,
    gen_kwargs=config.gen_kwargs,
    apply_chat_template=config.apply_chat_template,
    system_instruction=config.system_instruction,
)

Direct Instantiation

from lm_eval.config.evaluate_config import EvaluatorConfig

config = EvaluatorConfig(
    model="hf",
    model_args={"pretrained": "gpt2", "dtype": "float32"},
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
    output_path="./results/",
    log_samples=True,
)

# Validate and process
task_manager = config.process_tasks()

Config Fields

See the Configuration Guide for all available fields.


Using evaluate()

The evaluate() function provides lower-level control, accepting pre-built task dictionaries.

With Custom Task Objects

import lm_eval
from lm_eval.tasks import TaskManager, get_task_dict
from lm_eval.models.huggingface import HFLM

# Initialize model
lm = HFLM(pretrained="gpt2", batch_size=16)

# Build task dictionary
task_manager = TaskManager(include_path="/path/to/custom/tasks")
task_dict = get_task_dict(
    ["hellaswag", "my_custom_task"],
    task_manager
)

# Run evaluation
results = lm_eval.evaluate(
    lm=lm,
    task_dict=task_dict,
    num_fewshot=5,
    limit=100,
)

Mixed Task Sources

from lm_eval.tasks import get_task_dict

# Combine different task sources
task_dict = get_task_dict(
    [
        "mmlu",                           # Stock task name
        "my_custom_task",                 # From include_path
        {"task": "inline_task", ...},     # Inline config dict
    ],
    task_manager
)

Custom Models

To evaluate a custom model, create a subclass of lm_eval.api.model.LM:

from lm_eval.api.model import LM

class MyCustomLM(LM):
    def __init__(self, model, batch_size=1):
        super().__init__()
        self.model = model
        self._batch_size = batch_size

    def loglikelihood(self, requests):
        # Return list of (logprob, is_greedy) tuples
        ...

    def generate_until(self, requests):
        # Return list of generated strings
        ...

    def loglikelihood_rolling(self, requests):
        # Return list of (logprob, is_greedy) tuples
        ...

    @property
    def batch_size(self):
        return self._batch_size

Then use it with simple_evaluate():

my_model = load_my_model()
lm = MyCustomLM(model=my_model, batch_size=16)

results = lm_eval.simple_evaluate(
    model=lm,
    tasks=["hellaswag"],
)

For detailed guidance on implementing custom models, see the Model Guide.


Logging

Configure logging for debugging:

from lm_eval.utils import setup_logging

# Set log level
setup_logging("DEBUG")  # DEBUG, INFO, WARNING, ERROR

# Or use environment variable
import os
os.environ["LMEVAL_LOG_LEVEL"] = "DEBUG"

Examples

Batch Evaluation of Multiple Models

import lm_eval

models = [
    "gpt2",
    "gpt2-medium",
    "gpt2-large",
]

all_results = {}
for model_name in models:
    results = lm_eval.simple_evaluate(
        model="hf",
        model_args=f"pretrained={model_name}",
        tasks=["hellaswag"],
        batch_size="auto",
    )
    all_results[model_name] = results["results"]

Save and Load Results

import json
import lm_eval
from lm_eval.utils import handle_non_serializable

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["hellaswag"],
)

# Save results
with open("results.json", "w") as f:
    json.dump(results, f, default=handle_non_serializable, indent=2)