How to run the playbook

Prerequisite

uv installed on machine
Databricks workspace with MLflow tracking enabled (or local MLflow server)
Databricks CLI installed (if using Databricks tracking)

Setup

1. Environment Configuration

Copy .env.example to .env and configure your API endpoints:

cp .env.example .env

Edit .env with your actual values. The playbook supports separate configurations for the model being tested and the judge model:

# MLflow Tracking - Databricks Unity Catalog
MLFLOW_TRACKING_URI=databricks://<DATABRICKS_AUTH_PROFILE>
MLFLOW_REGISTRY_URI=databricks-uc
DATABRICKS_HOST=<DATABRICKS_URL>
MLFLOW_EXPERIMENT_ID=<EXPERIMENT_ID>

# Model Configuration (the model being evaluated)
MODEL_API_BASE=https://openrouter.ai/api/v1
MODEL_API_KEY=<YOUR_MODEL_API_KEY>
MODEL_API_VERSION=preview

# Judge Configuration (the model scoring responses)
JUDGE_API_BASE=https://openrouter.ai/api/v1
JUDGE_API_KEY=<YOUR_JUDGE_API_KEY>
JUDGE_API_VERSION=preview

Supported Providers:

OpenRouter: https://openrouter.ai/api/v1
Azure OpenAI/Foundry: https://<INSTANCE_NAME>.openai.azure.com/openai/v1
Ollama (local): http://localhost:11434/v1/
Alibaba Cloud: https://[ServiceCode].[RegionID].aliyuncs.com/v1
Others (Anthropic, Bedrock, Cohere, Together AI) may require additional dependencies

2. Databricks Authentication

If using Databricks MLflow tracking, authenticate before running evaluations:

databricks auth login --profile <DATABRICKS_AUTH_PROFILE> --host <DATABRICKS_URL>

Note: Replace <DATABRICKS_AUTH_PROFILE> and <DATABRICKS_URL> with values from your .env file.

Running Evaluations

Run Default Suites

When no --eval flag is provided, three default suites run: enterprise, hallucination, model (vision is excluded by default).

uv run playbook.py --model "openai/gpt-4o" --judge "openai/gpt-4o"

Run All Suites (Including Vision)

uv run playbook.py --model "openai/gpt-4o" --judge "openai/gpt-4o" --eval all

Run Specific Suites

Pass a comma-separated list to --eval (or -e). Suites always execute in canonical order regardless of the order you specify them on the command line.

# Run only the enterprise suite
uv run playbook.py --model "openai/gpt-4o" --judge "openai/gpt-4o" \
    --eval enterprise

# Run enterprise and model suites (executed as enterprise → model)
uv run playbook.py --model "openai/gpt-4o" --judge "openai/gpt-4o" \
    --eval enterprise,model

# Run enterprise and vision only
uv run playbook.py --model "openai/gpt-4o" --judge "openai/gpt-4o" \
    --eval enterprise,vision

Test with Limit (for debugging)

uv run playbook.py --model "google/gemini-3-flash-preview" --judge "gpt-5.2" --limit 1

Available Suites

Suite	Data File	Description
`enterprise`	`data/enterprise.json`	Brand safety, regulatory compliance, privacy, prompt injection, and other enterprise policy checks
`hallucination`	`data/hallucination.json`	Groundedness and factual accuracy evaluation
`model`	`data/data.json`	General model quality — correctness, clarity, BLEU, ROUGE, cosine similarity
`vision`	`data/vision.json`	Multimodal / vision evaluation (optional — skipped if data file is absent)

Concurrency

Suites run sequentially. Within each suite, MLflow parallelizes predict_fn calls internally. Use --max-concurrent to control how many predict_fn calls can run simultaneously (default: 10).

If a suite's data file is missing, that suite is skipped with a warning.

Command Options

Option	Short	Required	Description
`--model`		✅	Model name to evaluate (e.g. `openai/gpt-4o`)
`--judge`		✅	Judge model name for scoring
`--provider`			Provider for the model being tested (default: `openai`)
`--judge-provider`			Provider for the judge model (default: same as `--provider`)
`--eval`	`-e`		Comma-separated suites or `all`. Default: `enterprise,hallucination,model`
`--limit`			Limit number of test cases per dataset (useful for debugging)
`--max-concurrent`	`-c`		Max concurrent predict_fn calls across all suites (default: `10`)
`--data-dir`			Path to test data directory (default: `./data`)

MLflow Run Structure

Parent Run: playbook-{model}-{epoch}
├── Child: playbook-{model}-enterprise-{epoch}
├── Child: playbook-{model}-hallucination-{epoch}
├── Child: playbook-{model}-model-{epoch}
└── Child: playbook-{model}-vision-{epoch}

Each child run contains the evaluation metrics and traces for its suite. When running a subset of suites, only the selected child runs are created. Average token usage metrics are logged to every child run and propagated to the parent.

Viewing Results

Results are automatically logged to your configured MLflow tracking server.

Databricks Users

View results in your Databricks workspace:

https://<DATABRICKS_HOST>/ml/experiments/<EXPERIMENT_ID>

Local MLflow Users

Start the MLflow UI:

mlflow ui

Then navigate to http://localhost:5000

Generating Reports

Compare results across models from an MLflow experiment:

# Print comparison table
uv run report.py --experiment <EXPERIMENT_ID>

# Export to CSV
uv run report.py --experiment <EXPERIMENT_ID> --output results.csv

Configuration Details

Separate Model and Judge Endpoints

The playbook supports different API endpoints for the tested model and judge model. This enables scenarios like:

Testing a local Ollama model with GPT-5 as judge
Testing an Azure deployment with OpenRouter judge
Using different API keys for tested vs judge models

Environment Variable Fallback Chain:

Model being tested: MODEL_API_* → OPENAI_API_*
Judge model: JUDGE_API_* → MODEL_API_* → OPENAI_API_*

This maintains backward compatibility with existing configurations.

⚠️ Never commit .env to version control. Ensure it is listed in .gitignore.

Evaluation Specification

See enterprise.md for the full enterprise evaluation specification, including all 45 adversarial test prompts and expected model behaviours across 9 policy categories.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.skills/model-eval-playbook		.skills/model-eval-playbook
data		data
docs/superpowers		docs/superpowers
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
enterprise.md		enterprise.md
model-eval-playbook.skill		model-eval-playbook.skill
playbook.py		playbook.py
report.py		report.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to run the playbook

Prerequisite

Setup

1. Environment Configuration

2. Databricks Authentication

Running Evaluations

Run Default Suites

Run All Suites (Including Vision)

Run Specific Suites

Test with Limit (for debugging)

Available Suites

Concurrency

Command Options

MLflow Run Structure

Viewing Results

Databricks Users

Local MLflow Users

Generating Reports

Configuration Details

Separate Model and Judge Endpoints

Evaluation Specification

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How to run the playbook

Prerequisite

Setup

1. Environment Configuration

2. Databricks Authentication

Running Evaluations

Run Default Suites

Run All Suites (Including Vision)

Run Specific Suites

Test with Limit (for debugging)

Available Suites

Concurrency

Command Options

MLflow Run Structure

Viewing Results

Databricks Users

Local MLflow Users

Generating Reports

Configuration Details

Separate Model and Judge Endpoints

Evaluation Specification

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages