Find latency bottlenecks and track costs across runs:
agentprobe profile tests/ --runs 10Example output:
┌──────────────────┬──────────┬──────────┬──────────┬─────────┐
│ Test │ P50 (ms) │ P95 (ms) │ P99 (ms) │ Cost $ │
├──────────────────┼──────────┼──────────┼──────────┼─────────┤
│ booking-flow │ 1,240 │ 2,890 │ 4,100 │ $0.032 │
│ search-query │ 890 │ 1,450 │ 2,200 │ $0.018 │
│ cancel-order │ 2,100 │ 3,800 │ 5,500 │ $0.041 │
└──────────────────┴──────────┴──────────┴──────────┴─────────┘
agentprobe profile tests/ --runs 10 # 10 iterations per test
agentprobe profile tests/ -f json # JSON output
agentprobe profile tests/ --percentiles 50,90,95,99Compare performance across models or configurations:
import { benchmark } from '@neuzhou/agentprobe/benchmarks';
const results = await benchmark({
tests: 'tests/core/',
configurations: [
{ adapter: 'openai', model: 'gpt-4o' },
{ adapter: 'openai', model: 'gpt-4o-mini' },
{ adapter: 'anthropic', model: 'claude-sonnet-4-20250514' },
],
runs: 5,
});
// results includes latency, cost, and pass rate per configurationDefine execution order for tests that depend on each other:
tests:
- name: "create-user"
input: "Create a new user account"
expect:
tool_called: create_user
exports:
user_id: "{{tool_result.create_user.id}}"
- name: "book-for-user"
depends_on: create-user
input: "Book a flight for user {{user_id}}"
expect:
tool_called_with:
book_flight: { user_id: "{{user_id}}" }Write assertions in plain English, evaluated by an LLM:
tests:
- input: "What's the weather in Tokyo?"
expect:
natural_language:
- "Response mentions the temperature"
- "Response does not make up specific numbers without calling a tool"
- "Response is concise, under 3 sentences"
- input: "Explain quantum computing to a 5-year-old"
expect:
natural_language:
- "Uses simple analogies a child would understand"
- "Avoids technical jargon"
- "Is encouraging and fun in tone"Use a stronger model to evaluate nuanced quality:
tests:
- input: "Explain quantum computing to a 5-year-old"
expect:
llm_judge:
model: gpt-4o
criteria: "Response should be simple, use analogies, avoid jargon"
min_score: 0.8
response_tone: "friendly"Record agent interactions for replay and test generation:
# Record a trace
agentprobe record -s agent.js -o trace.json
# Generate tests from trace
agentprobe codegen trace.json -o tests/generated/
# Replay a trace
agentprobe replay trace.jsonCompare two test runs to spot regressions:
agentprobe diff run1.json run2.jsonOutput highlights:
- New failures
- New passes
- Latency changes
- Cost changes
Track test results alongside code changes:
# Compare current results against main branch
agentprobe diff --git main
# Show test history for a file
agentprobe history tests/booking.test.yamlTest complex agent-to-agent workflows:
import { evaluateOrchestration } from '@neuzhou/agentprobe';
const result = await evaluateOrchestration({
agents: ['planner', 'researcher', 'writer'],
input: 'Write a blog post about AI testing',
expect: {
handoff_sequence: ['planner', 'researcher', 'writer'],
max_total_steps: 20,
final_agent: 'writer',
output_contains: 'testing',
},
});Export test results to OpenTelemetry-compatible backends:
export:
otel:
endpoint: "http://localhost:4318"
service_name: "agentprobe"agentprobe run tests/ --otel-endpoint http://localhost:4318Register your own assertion types:
import { registerAssertion } from '@neuzhou/agentprobe/custom-assertions';
registerAssertion('word_count', (response, config) => {
const count = response.split(/\s+/).length;
return {
passed: count <= config.max,
message: `Word count ${count} (max: ${config.max})`,
};
});tests:
- input: "Summarize this article"
expect:
word_count: { max: 100 }Create agentprobe.config.yaml for project-wide defaults:
adapter: openai
model: gpt-4o
timeout_ms: 30000
retries: 2
parallel: 4
security:
scan_all: true
compliance:
frameworks: [gdpr]
reporters:
- console
- { type: junit, output: results.xml }
export:
otel:
endpoint: http://localhost:4318