Skip to content

Latest commit

 

History

History
758 lines (594 loc) · 18.7 KB

File metadata and controls

758 lines (594 loc) · 18.7 KB

AgentProbe API Reference

For getting started, installation, and overview, see the README.

Table of Contents


Core API

runSuite(suitePath: string, options?: RunOptions): Promise<SuiteResult>

Run a YAML test suite and return results.

import { runSuite } from 'agentprobe';

const results = await runSuite('tests/agent.yaml', {
  tags: ['smoke'],
  updateSnapshots: false,
  envFile: '.env.test',
});

console.log(`${results.passed}/${results.total} passed`);
process.exit(results.failed > 0 ? 1 : 0);

Options:

Field Type Description
tags string[] Filter tests by tags
group string Filter by group name
updateSnapshots boolean Update snapshot files
envFile string Load env vars from file

evaluate(trace: AgentTrace, expect: Expectations): AssertionResult[]

Evaluate a trace against expectations programmatically.

import { evaluate, loadTrace } from 'agentprobe';

const trace = loadTrace('traces/weather.json');
const results = evaluate(trace, {
  tool_called: 'web_search',
  output_contains: 'Tokyo',
  max_steps: 10,
  max_cost_usd: 0.05,
});

for (const r of results) {
  console.log(`${r.passed ? '✅' : '❌'} ${r.name}`);
}

loadTrace(filePath: string): AgentTrace

Load a trace JSON file.

report(result: SuiteResult, format: ReportFormat, theme?: string): string

Format results as console, json, markdown, html, or junit.


Assertions

All assertions available in YAML expect: blocks and via evaluate().

Basic Assertions

Assertion Type Description
tool_called string | string[] Tool(s) must have been called
tool_not_called string | string[] Tool(s) must NOT have been called
output_contains string | string[] Output must contain substring(s)
output_not_contains string | string[] Output must NOT contain substring(s)
output_matches string Output must match regex
max_steps number Maximum trace steps allowed
max_tokens number Maximum total tokens
max_duration_ms number Maximum execution time
max_cost_usd number Maximum estimated cost in USD
snapshot boolean Snapshot testing (like Jest)

Advanced Assertions

Assertion Type Description
tool_sequence string[] Tools must be called in this order
tool_args_match Record<string, any> Verify tool call arguments
chain ChainStep[] Sequential step validation
custom string Custom JS expression
judge JudgeSpec LLM-as-Judge evaluation
judge_rubric RubricCriterion[] Multi-criteria LLM judging

Composed Assertions

Assertion Type Description
all_of Expectations[] ALL must pass (AND)
any_of Expectations[] At least one must pass (OR)
none_of Expectations[] NONE must pass (NOR)
not Expectations Negate an expectation

YAML Examples

tests:
  # Basic: tool must be called, output checked
  - name: Weather lookup
    input: "What's the weather in Tokyo?"
    trace: traces/weather.json
    expect:
      tool_called: web_search
      output_contains: Tokyo
      max_steps: 10

  # Tool sequence validation
  - name: Research agent flow
    input: "Research quantum computing"
    trace: traces/research.json
    expect:
      tool_sequence: [web_search, summarize, write_file]

  # Composed assertions
  - name: Either search or browse
    input: "Find latest news"
    trace: traces/news.json
    expect:
      any_of:
        - tool_called: web_search
        - tool_called: browse_url

  # Chain assertion
  - name: Multi-step workflow
    input: "Analyze this data"
    trace: traces/analysis.json
    expect:
      chain:
        - tool_called: read_file
        - tool_called: analyze
          output_contains: "result"

  # Parameterized tests
  - name: "Weather in ${city}"
    input: "Weather in ${city}?"
    trace: traces/weather.json
    each:
      - city: Tokyo
      - city: London
      - city: "New York"
    expect:
      tool_called: web_search
      output_contains: "${city}"

  # With tags and dependencies
  - name: Setup data
    id: setup
    input: "Create test file"
    trace: traces/setup.json
    tags: [setup]
    expect:
      tool_called: write_file

  - name: Process data
    input: "Process the test file"
    trace: traces/process.json
    depends_on: setup
    tags: [integration]
    expect:
      tool_called: read_file

Mocking

class MockToolkit

Jest-style tool mocking for agent tests.

import { MockToolkit } from 'agentprobe';

const mocks = new MockToolkit();

// Basic mock
const searchMock = mocks.mock('web_search', (args) => ({
  results: [{ title: 'Test', url: 'https://example.com' }],
}));

// Return fixed value once
mocks.mockOnce('get_weather', { temp: 72, unit: 'F' });

// Return values in sequence
mocks.mockSequence('database_query', [
  { rows: [] },
  { rows: [{ id: 1, name: 'test' }] },
]);

// Mock to throw an error
mocks.mockError('failing_api', 'Service unavailable');

// Call a mock
const result = mocks.call('web_search', { query: 'test' });

// Inspect calls
console.log(searchMock.callCount);       // 1
console.log(searchMock.calls[0].args);   // { query: 'test' }

// Check if mock exists
mocks.has('web_search');  // true

// Reset all mocks
mocks.reset();

YAML integration:

tests:
  - name: With mocked tools
    input: "Search for cats"
    mocks:
      web_search:
        results: [{ title: "Cats", url: "https://cats.com" }]
    expect:
      tool_called: web_search
      output_contains: Cats

Fault Injection

class FaultInjector

Chaos engineering for AI agents — inject failures to test resilience.

import { FaultInjector } from 'agentprobe';

const injector = new FaultInjector({
  web_search: { type: 'error', message: 'API rate limited' },
  database:   { type: 'timeout', delay_ms: 30000 },
  weather:    { type: 'slow', delay_ms: 5000 },
  parser:     { type: 'corrupt' },
  flaky_api:  { type: 'error', message: 'Intermittent', probability: 0.3 },
});

// Check if fault applies
injector.shouldInject('web_search');  // true (probability=1.0)

// Wrap a tool call
try {
  const result = await injector.wrapToolCall('web_search', async () => {
    return fetch('https://api.search.com/search');
  });
} catch (err) {
  // FaultInjectionError: API rate limited
  console.log(err.toolName, err.faultConfig);
}

Fault types:

Type Description
error Throws FaultInjectionError with custom message
timeout Delays then throws timeout error
slow Adds delay but returns normally
corrupt Returns { corrupted: true, original: null }

YAML integration:

tests:
  - name: Handles API failure gracefully
    input: "Search for data"
    faults:
      web_search:
        type: error
        message: "503 Service Unavailable"
    expect:
      output_contains: "sorry"
      tool_not_called: exec

LLM-as-Judge

Use an LLM to evaluate output quality with caching.

Single Criterion

tests:
  - name: Helpful response
    input: "Explain quantum computing"
    trace: traces/quantum.json
    expect:
      judge:
        criteria: "Response is accurate, clear, and educational"
        model: gpt-4o-mini    # optional, default: gpt-4o-mini
        threshold: 0.7         # optional, default: 0.7

Multi-Criteria Rubric

tests:
  - name: Quality response
    input: "Write a summary of AI trends"
    trace: traces/summary.json
    expect:
      judge_rubric:
        - criterion: "Factually accurate"
          weight: 3
        - criterion: "Well-structured"
          weight: 2
        - criterion: "Concise"
          weight: 1

Programmatic API

import { judgeOutput, judgeWithRubric } from 'agentprobe/judge';

const result = await judgeOutput(output, {
  criteria: 'Response is helpful and accurate',
  model: 'gpt-4o-mini',
  threshold: 0.7,
});
// { passed: true, score: 0.85, reasoning: '...', model: 'gpt-4o-mini', cached: false }

const rubricResult = await judgeWithRubric(output, {
  rubric: [
    { criterion: 'Accuracy', weight: 3 },
    { criterion: 'Clarity', weight: 2 },
  ],
  threshold: 0.7,
});
// { passed: true, overallScore: 0.82, scores: [...], cached: false }

Results are cached in .agentprobe-cache/ to avoid repeated LLM calls.


Recorder

Record live agent executions as traces.

class Recorder

import { Recorder } from 'agentprobe';

const recorder = new Recorder({ recorded_at: new Date().toISOString() });

// Auto-patch OpenAI/Anthropic SDKs
recorder.patchOpenAI(require('openai'));
recorder.patchAnthropic(require('@anthropic-ai/sdk'));

// Run your agent
await myAgent.run('What is the weather?');

// Save the trace
recorder.save('traces/weather.json');

Trace Adapters

Auto-convert traces from other formats.

Built-in Adapters

Adapter Detects
openai OpenAI chat completions
anthropic Anthropic message format
langchain LangChain run traces
gemini Google Gemini responses
ollama Ollama completions
openai-compatible OpenAI-compatible APIs
openclaw OpenClaw native format
crewai CrewAI multi-agent task traces
autogen Microsoft AutoGen conversation traces
mcp Model Context Protocol tool call traces
generic Generic event logs

Auto-Detection

import { autoConvert, convertWith } from 'agentprobe/adapters';

// Auto-detect format
const trace = autoConvert(rawData);

// Explicit format
const trace2 = convertWith('openai', rawOpenAIData);

Writing a Custom Adapter

import type { TraceAdapter, AgentTrace } from 'agentprobe';

const myAdapter: TraceAdapter = {
  name: 'my-platform',
  detect: (input) => !!input?.my_platform_version,
  convert: (input): AgentTrace => ({
    id: input.id,
    timestamp: input.created_at,
    steps: input.events.map(e => ({
      type: e.kind === 'tool' ? 'tool_call' : 'llm_call',
      timestamp: e.timestamp,
      data: { tool_name: e.name, content: e.output },
      duration_ms: e.duration,
    })),
    metadata: { source: 'my-platform' },
  }),
};

Security Testing

generateSecurityTests(options?: { categories?: string[] }): TestCase[]

Generate built-in security test suites.

import { generateSecurityTests } from 'agentprobe';

const tests = generateSecurityTests({
  categories: ['injection', 'exfiltration', 'privilege', 'harmful'],
});

Categories:

  • injection — Prompt injection attempts
  • exfiltration — Data leak attempts
  • privilege — Privilege escalation
  • harmful — Harmful content generation

Plugin System

Configuration (.agentproberc.yml)

plugins:
  - agentprobe-plugin-cache
  - agentprobe-plugin-cost-tracker
  - ./my-local-plugin

adapters:
  default: openai

profiles:
  staging:
    model: gpt-4o-mini
    adapter: openai
    env:
      API_KEY: staging-key
    timeout_ms: 30000

  production:
    model: gpt-4o
    adapter: openai

ci:
  fail_on_regression: true

Built-in Plugins

Plugin Description
cache Response caching for faster reruns
cost-tracker Track API costs per test
coverage Tool coverage analysis
retry Auto-retry flaky tests

Reporters

Available Formats

Format Output
console Colorized terminal output
json Machine-readable JSON
markdown Markdown table
html Interactive HTML report with themes
junit JUnit XML for CI integration

HTML Themes

Theme Description
dark Dark mode dashboard
corporate Clean enterprise look
minimal Minimal styling
agentprobe run tests.yaml -f html --theme dark -o report.html

Types

Core Types

interface AgentTrace {
  id: string;
  timestamp: string;
  steps: TraceStep[];
  metadata: Record<string, any>;
}

type StepType = 'llm_call' | 'tool_call' | 'tool_result' | 'thought' | 'output';

interface TraceStep {
  type: StepType;
  timestamp: string;
  data: {
    model?: string;
    messages?: Message[];
    tool_name?: string;
    tool_args?: Record<string, any>;
    tool_result?: any;
    content?: string;
    tokens?: { input?: number; output?: number };
  };
  duration_ms?: number;
}

interface TestSuite {
  name: string;
  description?: string;
  config?: TestConfig;
  hooks?: SuiteHooks;
  tests: TestCase[];
}

interface SuiteResult {
  name: string;
  passed: number;
  failed: number;
  total: number;
  duration_ms: number;
  results: TestResult[];
}

interface TestResult {
  name: string;
  passed: boolean;
  assertions: AssertionResult[];
  duration_ms: number;
  trace?: AgentTrace;
  error?: string;
  tags?: string[];
  skipped?: boolean;
  attempts?: number;
}

interface AssertionResult {
  name: string;
  passed: boolean;
  expected?: any;
  actual?: any;
  message?: string;
}

type ReportFormat = 'console' | 'json' | 'markdown' | 'html' | 'junit';

CLI Reference

agentprobe <command> [options]

Core Commands

Command Description
run <suite...> Run test suite(s) from YAML files
record Record an agent execution trace
replay <trace> Replay and display a trace
init Interactive project scaffolding
validate <file> Validate suite YAML or trace JSON

Test Generation

Command Description
codegen <trace> Generate YAML tests from a trace
generate <description> Generate tests from natural language
generate-security Generate security test suite
generate-from-openapi <spec> Generate tests from OpenAPI spec
gen-from-docs <file> Generate tests from API docs

Trace Operations

Command Description
trace view <file> Visual trace inspection
trace timeline <file> Gantt-style timeline
trace diff <old> <new> Compare two traces
trace compare <a> <b> Side-by-side comparison
trace merge <files...> Merge multiple traces
trace anonymize <file> Remove sensitive data
trace validate <file> Validate trace format
trace export <file> Export to OTel/LangSmith/CSV
trace enrich <dir> Auto-enrich traces with metadata
trace otel <file> Export as OTLP JSON

Analysis

Command Description
stats <dir> Summary statistics from traces
profile <dir> Performance profiling
search <query> <dir> Search across traces
suggest <trace> AI-suggest tests to write
fingerprint <dir> Behavioral fingerprinting
safety-score <dir> Compute agent safety score
coverage-map <suite> Visualize capability coverage
behavior-profile <dir> Profile behavior patterns

Regression & Comparison

Command Description
baseline save <suite> Save baseline for regression detection
baseline compare <suite> Compare against baseline
regression add <suite> Add labeled regression snapshot
regression compare <a> <b> Compare two snapshots
diff <old> <new> Compare two run reports
agent-diff --v1 <dir> --v2 <dir> Compare agent versions
perf-check Detect performance regressions

Testing Utilities

Command Description
mutate <suite> Mutation testing
flaky <suite> Flaky test detection
ab-test A/B test two models
chaos <test> Chaos testing
matrix <suite> Multi-model/temperature matrix
simulate Generate synthetic traces
replay-verify <trace> Deterministic replay verification
contract <contract> <trace> Verify trace against contract

CI/CD & DevOps

Command Description
ci github-actions Generate GitHub Actions workflow
ci gitlab Generate GitLab CI pipeline
ci azure-pipelines Generate Azure Pipelines YAML
ci circleci Generate CircleCI config
badge Generate shields.io-style badge

Dashboards & Reports

Command Description
portal <dir> Static HTML test dashboard
studio Interactive test studio
health Check adapter connectivity
health-dashboard Health monitoring dashboard
governance Fleet governance dashboard
compliance-report Regulated industry compliance
compliance-audit <dir> Audit against GDPR/SOC2/HIPAA/PCI-DSS

run Options

Flag Description Default
-f, --format <fmt> Output format: console, json, markdown, html, junit console
-o, --output <path> Write results to file
-w, --watch Watch mode false
-u, --update-snapshots Update snapshot files false
-t, --tag <tags...> Filter by tags
-g, --group <name> Filter by group
--coverage Show tool coverage false
--tools <tools...> Declared tools for coverage
--compare-baseline Compare against baseline false
--env-file <path> Load .env file
--badge <path> Generate badge SVG
--profile <name> Use environment profile
--theme <name> HTML theme: dark, corporate, minimal
-r, --recursive Find YAML files recursively false

Examples

# Run tests
agentprobe run tests.yaml
agentprobe run tests/ -r --coverage --tools search browse write

# Watch mode
agentprobe run tests.yaml -w

# Generate from trace
agentprobe record -s agent.js -o trace.json
agentprobe codegen trace.json -o tests/generated.yaml

# Security testing
agentprobe generate-security -o tests/security.yaml
agentprobe run tests/security.yaml --strict

# Performance analysis
agentprobe profile traces/
agentprobe stats traces/ --detailed

# CI integration
agentprobe ci github-actions
agentprobe run tests.yaml -f junit -o results.xml --badge badge.svg

# Compare runs
agentprobe baseline save tests.yaml
agentprobe run tests.yaml --compare-baseline

# Interactive setup
agentprobe init
agentprobe build