Rubric Schema Reference

This document provides a comprehensive reference for the OpenRubricRL rubric schema. Rubrics define how model outputs should be evaluated and are converted into LLM-based scoring prompts.

📋 Table of Contents

Overview
Schema Structure
Field Reference
Examples
Validation
Best Practices

Overview

OpenRubricRL uses a structured JSON/YAML schema to define evaluation rubrics. Each rubric contains:

Metadata: Name, version, description, and domain
Scale: Scoring range and type
Criteria: Individual evaluation dimensions with weights
Examples: Sample inputs/outputs with scores (optional)
Hybrid Metrics: Automated metrics to combine with LLM scoring (optional)

Schema Structure

{
  "name": "string",
  "version": "string (semver)",
  "description": "string (optional)",
  "domain": "enum (optional)",
  "scale": {
    "min": "number",
    "max": "number", 
    "type": "enum"
  },
  "criteria": [
    {
      "name": "string",
      "description": "string",
      "weight": "number (0-1)",
      "examples": "object (optional)",
      "subcriteria": "array (optional)"
    }
  ],
  "hybrid_metrics": "array (optional)",
  "metadata": "object (optional)"
}

Field Reference

Root Level Fields

`name` (required)

Type: string
Description: Unique identifier for the rubric
Example: "code_quality_basic"

`version` (required)

Type: string
Pattern: ^\d+\.\d+\.\d+$ (semantic versioning)
Description: Version of the rubric following semver format
Example: "1.0.0"

`description` (optional)

Type: string
Description: Human-readable description of what this rubric evaluates
Example: "Basic code quality evaluation rubric for Python functions"

`domain` (optional)

Type: string
Enum: ["code", "dialogue", "creative_writing", "reasoning", "general"]
Description: Domain this rubric is designed for
Example: "code"

Scale Object

`scale` (required)

Defines the scoring range and type.

{
  "min": 0.0,
  "max": 10.0,
  "type": "continuous"
}

`scale.min` (required)

Type: number
Description: Minimum possible score
Example: 0.0

`scale.max` (required)

Type: number
Description: Maximum possible score
Example: 10.0

`scale.type` (optional)

Type: string
Enum: ["continuous", "discrete"]
Default: "continuous"
Description: Whether scores can be any value in range (continuous) or only integers (discrete)

Criteria Array

`criteria` (required)

Array of evaluation criteria. Must contain at least one criterion.

{
  "name": "correctness",
  "description": "Does the code solve the problem correctly?",
  "weight": 0.4,
  "examples": {
    "excellent": [...],
    "good": [...],
    "poor": [...]
  },
  "subcriteria": [...]
}

`criteria[].name` (required)

Type: string
Description: Unique name for this criterion
Example: "correctness"

`criteria[].description` (required)

Type: string
Description: Detailed description of what this criterion evaluates
Example: "Does the code solve the problem correctly and handle edge cases?"

`criteria[].weight` (required)

Type: number
Range: 0.0 to 1.0
Constraint: All criterion weights must sum to 1.0
Description: Relative importance of this criterion
Example: 0.4

`criteria[].examples` (optional)

Object containing example inputs/outputs with scores and explanations.

{
  "excellent": [
    {
      "input": "Write a function to reverse a string",
      "output": "def reverse_string(s):\n    return s[::-1]",
      "score": 9.5,
      "explanation": "Correct and efficient implementation"
    }
  ],
  "good": [...],
  "poor": [...]
}

Example Object Structure:

input (string): The task or prompt given
output (string): The model's response
score (number): Score for this example
explanation (string): Why this score was given

`criteria[].subcriteria` (optional)

Array of sub-criteria for more granular evaluation.

[
  {
    "name": "naming",
    "description": "Variable and function names are descriptive"
  },
  {
    "name": "structure", 
    "description": "Code is well-organized and follows conventions"
  }
]

Hybrid Metrics (Optional)

`hybrid_metrics` (optional)

Array of automated metrics to combine with LLM scoring.

[
  {
    "name": "bleu_score",
    "type": "bleu",
    "weight": 0.2,
    "config": {
      "reference_key": "expected_output"
    }
  }
]

`hybrid_metrics[].name` (required)

Type: string
Description: Name of the metric

`hybrid_metrics[].type` (required)

Type: string
Enum: ["bleu", "rouge", "accuracy", "perplexity", "custom"]
Description: Type of automated metric

`hybrid_metrics[].weight` (required)

Type: number
Range: 0.0 to 1.0
Description: Weight of this metric in final score

`hybrid_metrics[].config` (optional)

Type: object
Description: Configuration specific to the metric type

Metadata (Optional)

`metadata` (optional)

Additional information about the rubric.

{
  "author": "OpenRubricRL Team",
  "created_at": "2024-01-01T00:00:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "tags": ["python", "code-quality"],
  "license": "MIT"
}

Examples

Basic Code Quality Rubric

{
  "name": "code_quality_basic",
  "version": "1.0.0",
  "description": "Basic code quality evaluation for Python functions",
  "domain": "code",
  "scale": {
    "min": 0.0,
    "max": 10.0,
    "type": "continuous"
  },
  "criteria": [
    {
      "name": "correctness",
      "description": "Does the code solve the problem correctly?",
      "weight": 0.5,
      "examples": {
        "excellent": [
          {
            "input": "Write a function to find the maximum of two numbers",
            "output": "def max_two(a, b):\n    return a if a > b else b",
            "score": 9.0,
            "explanation": "Correct, concise, and readable"
          }
        ]
      }
    },
    {
      "name": "style",
      "description": "Does the code follow Python style conventions?",
      "weight": 0.3,
      "subcriteria": [
        {
          "name": "naming",
          "description": "Uses descriptive variable names"
        },
        {
          "name": "formatting",
          "description": "Follows PEP 8 formatting"
        }
      ]
    },
    {
      "name": "efficiency",
      "description": "Is the solution efficient?",
      "weight": 0.2
    }
  ],
  "metadata": {
    "author": "OpenRubricRL Team",
    "tags": ["python", "code-quality"],
    "license": "MIT"
  }
}

Dialogue Quality Rubric

name: dialogue_quality
version: 2.1.0
description: Evaluates quality of conversational AI responses
domain: dialogue
scale:
  min: 1
  max: 5
  type: discrete
criteria:
  - name: relevance
    description: How well does the response address the user's question?
    weight: 0.4
  - name: helpfulness
    description: Does the response provide useful information?
    weight: 0.3
  - name: tone
    description: Is the tone appropriate and friendly?
    weight: 0.2
  - name: safety
    description: Is the response safe and appropriate?
    weight: 0.1

Creative Writing Rubric with Hybrid Metrics

{
  "name": "creative_writing_advanced",
  "version": "1.2.0",
  "description": "Advanced creative writing evaluation with automated metrics",
  "domain": "creative_writing",
  "scale": {
    "min": 0.0,
    "max": 100.0,
    "type": "continuous"
  },
  "criteria": [
    {
      "name": "creativity",
      "description": "Originality and creative expression",
      "weight": 0.4
    },
    {
      "name": "coherence",
      "description": "Logical flow and structure",
      "weight": 0.3
    },
    {
      "name": "engagement",
      "description": "How engaging and interesting is the writing?",
      "weight": 0.3
    }
  ],
  "hybrid_metrics": [
    {
      "name": "readability",
      "type": "custom",
      "weight": 0.1,
      "config": {
        "metric_function": "flesch_reading_ease"
      }
    }
  ]
}

Validation

Automatic Validation

OpenRubricRL automatically validates rubrics when loading:

from openrubricrl.core.rubric import Rubric

# Load and validate from file
rubric = Rubric.from_file("my_rubric.json")

# Validate against JSON schema
is_valid = rubric.validate_schema()

Common Validation Errors

Weight Sum Error: Criterion weights don't sum to 1.0

ValueError: Criterion weights must sum to 1.0, got 0.9

Invalid Version: Version doesn't follow semver format

ValidationError: version must match pattern ^\d+\.\d+\.\d+$

Invalid Domain: Domain not in allowed values

ValidationError: domain must be one of: code, dialogue, creative_writing, reasoning, general

CLI Validation

Use the CLI to validate rubrics:

# Validate only
openrubricrl validate my_rubric.json --validate-only

# Validate and pretty-print
openrubricrl validate my_rubric.json --output-format yaml

Best Practices

1. Criterion Design

Use 3-7 criteria: Too few lacks nuance, too many becomes unwieldy
Make criteria independent: Avoid overlap between criteria
Weight by importance: Critical aspects should have higher weights
Provide clear descriptions: Be specific about what each criterion evaluates

2. Examples

Include diverse examples: Cover excellent, good, and poor cases
Provide explanations: Help the LLM understand your reasoning
Use realistic scenarios: Examples should reflect actual use cases
Keep examples concise: Focus on the key aspects being evaluated

3. Scale Selection

Continuous vs Discrete: Use continuous for nuanced scoring, discrete for clear categories
Appropriate range: 0-10 for detailed scoring, 1-5 for simple ratings
Consider your use case: Match scale to how scores will be used

4. Version Management

Semantic versioning: Use major.minor.patch format
Document changes: Keep track of what changed between versions
Backward compatibility: Consider impact of changes on existing evaluations

5. Domain Specificity

Choose appropriate domain: Helps with prompt optimization
Domain-specific criteria: Tailor criteria to your specific domain
Use domain examples: Examples should be relevant to the domain

6. Testing and Iteration

Test with real data: Validate rubrics on actual model outputs
Iterate based on results: Refine criteria and weights based on performance
Get human feedback: Have domain experts review your rubrics

Schema Validation

The complete JSON schema is available at rubric_schema.json in the repository root. You can use it to validate rubrics in any JSON schema validator:

# Using ajv-cli
ajv validate -s rubric_schema.json -d my_rubric.json

# Using jsonschema (Python)
python -m jsonschema rubric_schema.json my_rubric.json

Migration Guide

From v1.0 to v2.0

If you have rubrics from an older version, here are the key changes:

Required fields: version is now required
Weight validation: Weights must sum exactly to 1.0
Domain enum: Limited to specific domains
Example structure: Examples now require explanation field

Automated Migration

Use the CLI to help migrate old rubrics:

openrubricrl migrate old_rubric.json --to-version 2.0.0

For more information, see the API Documentation and Integration Guide.

FilesExpand file tree

schema.md

Latest commit

History