This document provides a comprehensive reference for the OpenRubricRL rubric schema. Rubrics define how model outputs should be evaluated and are converted into LLM-based scoring prompts.
OpenRubricRL uses a structured JSON/YAML schema to define evaluation rubrics. Each rubric contains:
- Metadata: Name, version, description, and domain
- Scale: Scoring range and type
- Criteria: Individual evaluation dimensions with weights
- Examples: Sample inputs/outputs with scores (optional)
- Hybrid Metrics: Automated metrics to combine with LLM scoring (optional)
{
"name": "string",
"version": "string (semver)",
"description": "string (optional)",
"domain": "enum (optional)",
"scale": {
"min": "number",
"max": "number",
"type": "enum"
},
"criteria": [
{
"name": "string",
"description": "string",
"weight": "number (0-1)",
"examples": "object (optional)",
"subcriteria": "array (optional)"
}
],
"hybrid_metrics": "array (optional)",
"metadata": "object (optional)"
}- Type:
string - Description: Unique identifier for the rubric
- Example:
"code_quality_basic"
- Type:
string - Pattern:
^\d+\.\d+\.\d+$(semantic versioning) - Description: Version of the rubric following semver format
- Example:
"1.0.0"
- Type:
string - Description: Human-readable description of what this rubric evaluates
- Example:
"Basic code quality evaluation rubric for Python functions"
- Type:
string - Enum:
["code", "dialogue", "creative_writing", "reasoning", "general"] - Description: Domain this rubric is designed for
- Example:
"code"
Defines the scoring range and type.
{
"min": 0.0,
"max": 10.0,
"type": "continuous"
}- Type:
number - Description: Minimum possible score
- Example:
0.0
- Type:
number - Description: Maximum possible score
- Example:
10.0
- Type:
string - Enum:
["continuous", "discrete"] - Default:
"continuous" - Description: Whether scores can be any value in range (continuous) or only integers (discrete)
Array of evaluation criteria. Must contain at least one criterion.
{
"name": "correctness",
"description": "Does the code solve the problem correctly?",
"weight": 0.4,
"examples": {
"excellent": [...],
"good": [...],
"poor": [...]
},
"subcriteria": [...]
}- Type:
string - Description: Unique name for this criterion
- Example:
"correctness"
- Type:
string - Description: Detailed description of what this criterion evaluates
- Example:
"Does the code solve the problem correctly and handle edge cases?"
- Type:
number - Range:
0.0to1.0 - Constraint: All criterion weights must sum to
1.0 - Description: Relative importance of this criterion
- Example:
0.4
Object containing example inputs/outputs with scores and explanations.
{
"excellent": [
{
"input": "Write a function to reverse a string",
"output": "def reverse_string(s):\n return s[::-1]",
"score": 9.5,
"explanation": "Correct and efficient implementation"
}
],
"good": [...],
"poor": [...]
}Example Object Structure:
input(string): The task or prompt givenoutput(string): The model's responsescore(number): Score for this exampleexplanation(string): Why this score was given
Array of sub-criteria for more granular evaluation.
[
{
"name": "naming",
"description": "Variable and function names are descriptive"
},
{
"name": "structure",
"description": "Code is well-organized and follows conventions"
}
]Array of automated metrics to combine with LLM scoring.
[
{
"name": "bleu_score",
"type": "bleu",
"weight": 0.2,
"config": {
"reference_key": "expected_output"
}
}
]- Type:
string - Description: Name of the metric
- Type:
string - Enum:
["bleu", "rouge", "accuracy", "perplexity", "custom"] - Description: Type of automated metric
- Type:
number - Range:
0.0to1.0 - Description: Weight of this metric in final score
- Type:
object - Description: Configuration specific to the metric type
Additional information about the rubric.
{
"author": "OpenRubricRL Team",
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-15T10:30:00Z",
"tags": ["python", "code-quality"],
"license": "MIT"
}{
"name": "code_quality_basic",
"version": "1.0.0",
"description": "Basic code quality evaluation for Python functions",
"domain": "code",
"scale": {
"min": 0.0,
"max": 10.0,
"type": "continuous"
},
"criteria": [
{
"name": "correctness",
"description": "Does the code solve the problem correctly?",
"weight": 0.5,
"examples": {
"excellent": [
{
"input": "Write a function to find the maximum of two numbers",
"output": "def max_two(a, b):\n return a if a > b else b",
"score": 9.0,
"explanation": "Correct, concise, and readable"
}
]
}
},
{
"name": "style",
"description": "Does the code follow Python style conventions?",
"weight": 0.3,
"subcriteria": [
{
"name": "naming",
"description": "Uses descriptive variable names"
},
{
"name": "formatting",
"description": "Follows PEP 8 formatting"
}
]
},
{
"name": "efficiency",
"description": "Is the solution efficient?",
"weight": 0.2
}
],
"metadata": {
"author": "OpenRubricRL Team",
"tags": ["python", "code-quality"],
"license": "MIT"
}
}name: dialogue_quality
version: 2.1.0
description: Evaluates quality of conversational AI responses
domain: dialogue
scale:
min: 1
max: 5
type: discrete
criteria:
- name: relevance
description: How well does the response address the user's question?
weight: 0.4
- name: helpfulness
description: Does the response provide useful information?
weight: 0.3
- name: tone
description: Is the tone appropriate and friendly?
weight: 0.2
- name: safety
description: Is the response safe and appropriate?
weight: 0.1{
"name": "creative_writing_advanced",
"version": "1.2.0",
"description": "Advanced creative writing evaluation with automated metrics",
"domain": "creative_writing",
"scale": {
"min": 0.0,
"max": 100.0,
"type": "continuous"
},
"criteria": [
{
"name": "creativity",
"description": "Originality and creative expression",
"weight": 0.4
},
{
"name": "coherence",
"description": "Logical flow and structure",
"weight": 0.3
},
{
"name": "engagement",
"description": "How engaging and interesting is the writing?",
"weight": 0.3
}
],
"hybrid_metrics": [
{
"name": "readability",
"type": "custom",
"weight": 0.1,
"config": {
"metric_function": "flesch_reading_ease"
}
}
]
}OpenRubricRL automatically validates rubrics when loading:
from openrubricrl.core.rubric import Rubric
# Load and validate from file
rubric = Rubric.from_file("my_rubric.json")
# Validate against JSON schema
is_valid = rubric.validate_schema()-
Weight Sum Error: Criterion weights don't sum to 1.0
ValueError: Criterion weights must sum to 1.0, got 0.9 -
Invalid Version: Version doesn't follow semver format
ValidationError: version must match pattern ^\d+\.\d+\.\d+$ -
Invalid Domain: Domain not in allowed values
ValidationError: domain must be one of: code, dialogue, creative_writing, reasoning, general
Use the CLI to validate rubrics:
# Validate only
openrubricrl validate my_rubric.json --validate-only
# Validate and pretty-print
openrubricrl validate my_rubric.json --output-format yaml- Use 3-7 criteria: Too few lacks nuance, too many becomes unwieldy
- Make criteria independent: Avoid overlap between criteria
- Weight by importance: Critical aspects should have higher weights
- Provide clear descriptions: Be specific about what each criterion evaluates
- Include diverse examples: Cover excellent, good, and poor cases
- Provide explanations: Help the LLM understand your reasoning
- Use realistic scenarios: Examples should reflect actual use cases
- Keep examples concise: Focus on the key aspects being evaluated
- Continuous vs Discrete: Use continuous for nuanced scoring, discrete for clear categories
- Appropriate range: 0-10 for detailed scoring, 1-5 for simple ratings
- Consider your use case: Match scale to how scores will be used
- Semantic versioning: Use major.minor.patch format
- Document changes: Keep track of what changed between versions
- Backward compatibility: Consider impact of changes on existing evaluations
- Choose appropriate domain: Helps with prompt optimization
- Domain-specific criteria: Tailor criteria to your specific domain
- Use domain examples: Examples should be relevant to the domain
- Test with real data: Validate rubrics on actual model outputs
- Iterate based on results: Refine criteria and weights based on performance
- Get human feedback: Have domain experts review your rubrics
The complete JSON schema is available at rubric_schema.json in the repository root. You can use it to validate rubrics in any JSON schema validator:
# Using ajv-cli
ajv validate -s rubric_schema.json -d my_rubric.json
# Using jsonschema (Python)
python -m jsonschema rubric_schema.json my_rubric.jsonIf you have rubrics from an older version, here are the key changes:
- Required fields:
versionis now required - Weight validation: Weights must sum exactly to 1.0
- Domain enum: Limited to specific domains
- Example structure: Examples now require
explanationfield
Use the CLI to help migrate old rubrics:
openrubricrl migrate old_rubric.json --to-version 2.0.0For more information, see the API Documentation and Integration Guide.