A command line tool to analyze KubeVirt CI failures using two complementary data sources and analysis approaches.
This tool provides two distinct commands that use different data sources:
- Data Source: Uses pre-aggregated failure data from kubevirt/ci-health JSON API
- Coverage: Analyzes failures across all merge-time jobs (main branch, release branches)
- Time Range: Limited to the data available in ci-health (typically recent failures)
- Performance: Fast - processes pre-computed aggregations
- Use Case: Quick overview of current CI health across all job types
- Data Source: Crawls live Prow web pages and fetches individual job artifacts from multiple sources (presubmit, batch, periodic)
- Coverage: Analyzes any specific job lane in real-time, including batch jobs
- Time Range: Flexible - can go back weeks/months with automatic pagination
- Performance: Slower - fetches and parses individual job data on-demand
- Use Case: Deep dive analysis of specific job lanes with historical data
- Job Types: Supports presubmit, batch, periodic, and postsubmit jobs with per-type failure statistics
go build
./healthcheck --helpAnalyze recent job runs for a specific CI lane by crawling live Prow web pages and artifacts. Provides real-time data with flexible time ranges and automatic pagination.
# Analyze recent runs for a specific job
$ healthcheck lane pull-kubevirt-unit-test-arm64
# Limit to specific number of runs (default: 10)
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --limit 20# Count test failures across runs
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 5 -c
2 VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15447/pull-kubevirt-unit-test-arm64/1958193812496977920
1 Migration watcher Migration backoff should not be applied if it is not an evacuation with workload update annotation
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15388/pull-kubevirt-unit-test-arm64/1958193968416034816
# Show only test names
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 -n
VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts
Migration watcher Migration backoff should not be applied if it is not an evacuation with workload update annotation
# Show only failed job URLs
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 -u
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15388/pull-kubevirt-unit-test-arm64/1958193968416034816
# Show failure details with context
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 -f
VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts
goroutine 1847 [running]:
testing.tRunner.func1.2({0x2b2e5a0, 0xc001638690})
/opt/hostedtoolcache/go/1.21.13/x64/lib/go/src/testing/testing.go:1631 +0x2ff
...
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920
# Output structured JSON data for machine processing
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 --output json
{
"job_name": "pull-kubevirt-unit-test-arm64",
"all_failures": [
{
"Name": "VirtualMachineInstance migration target DomainNotifyServerRestarts...",
"URL": "https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/...",
"Failure": "goroutine 1847 [running]:\ntesting.tRunner.func1.2..."
}
]
}
# JSON output with count mode
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 10 -c --output json
{
"job_name": "pull-kubevirt-unit-test-arm64",
"test_failures": {
"Test Name 1": [
{"Name": "Test Name 1", "URL": "...", "Failure": "..."},
{"Name": "Test Name 1", "URL": "...", "Failure": "..."}
]
}
}The --since flag automatically paginates to find ALL results within the time period, ignoring --limit.
# Find all failures in the last hour
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 1h -c
1 VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920
# Analyze longer time periods - automatically finds all results
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 2d --summary
Lane Summary: pull-kubevirt-unit-test-arm64
===========================================
Time Range:
First Run: 2025-08-18 19:03:13 UTC
Last Run: 2025-08-20 16:22:12 UTC
Duration: 1.9 days
Test Run Statistics:
Total Runs: 92
Successful: 62
Failed: 15
Unknown: 15
Failure Rate: 16.3%
Test Failure Statistics:
Total Failures: 78
Unique Tests: 70
Failure Categories:
migration : 8 (10.3%)
general : 3 (3.8%)
storage : 2 (2.6%)
Most Frequent Failures:
1. [migration] VirtualMachineInstance migration target DomainNotifyServe... (8 failures, 10.3%)
2. [general] VirtualMachineInstance watcher On valid VirtualMachineIns... (2 failures, 2.6%)
3. [storage] VirtualMachineInstance watcher On valid VirtualMachineIns... (1 failures, 1.3%)
Pattern Analysis:
🟢 Very low failure rate - stable
🔀 Diverse failure patterns - no clear dominant issue
# Time period examples
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 6h # Last 6 hours
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 3d # Last 3 days
$ healthcheck lane pull-kubevirt-e2e-k8s-1.31-sig-storage --since 1w # Last week# Get comprehensive failure pattern analysis
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 25 --summary
Lane Summary: pull-kubevirt-unit-test-arm64
===========================================
Time Range:
First Run: 2025-08-20 09:48:24 UTC
Last Run: 2025-08-20 16:22:12 UTC
Duration: 6.6 hours
Test Run Statistics:
Total Runs: 25
Successful: 16
Failed: 7
Running: 2
Failure Rate: 28.0%
Job Types:
presubmit : 20 (80.0%, 30.0% failure rate)
batch : 5 (20.0%, 20.0% failure rate)
Individual Runs:
1. ✓ SUCCESS [presubmit]
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/...
2. ⋯ PENDING [presubmit]
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/...
3. ✗ FAILURE [presubmit] - 2 failure(s)
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/...
... (truncated for brevity)
Failure Analysis:
Total Failures: 28
Unique Tests: 25
Infrastructure: 42.9% of all failures
Failure Categories:
migration : 4 (14.3%)
general : 1 (3.6%)
storage : 3 (10.7%)
Most Frequent Failures:
1. [migration] VirtualMachineInstance migration target DomainNotifyServe... (4 failures, 14.3%)
2. [general] VirtualMachineInstance watcher On valid VirtualMachineIns... (1 failures, 3.6%)
3. [storage] VirtualMachineInstance watcher Aggregating DataVolume con... (1 failures, 3.6%)
Pattern Analysis:
🟠 Low failure rate - normal fluctuation
🔀 Diverse failure patterns - no clear dominant issueThe lane command now supports filtering by job type to analyze specific CI categories:
# Filter by batch jobs only
$ healthcheck lane pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64 --type batch --summary -s 7d
Lane Summary: pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64
==========================================================
Test Run Statistics:
Total Runs: 19
Successful: 16
Failed: 1
Running: 2
Failure Rate: 5.3%
Job Types:
batch : 19 (100.0%, 5.3% failure rate)
# Filter by presubmit jobs only
$ healthcheck lane pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64 --type presubmit --summary -s 7d
# Filter by periodic or postsubmit jobs
$ healthcheck lane periodic-kubevirt-e2e-k8s-1.32-sig-network --type periodic --summaryThis enables comparison of failure rates between different job types and helps identify if certain types (e.g., batch vs presubmit) have different failure characteristics.
Lane summaries now include per-job-type statistics showing both distribution and failure rates:
$ healthcheck lane pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64 --summary -s 7d
...
Test Run Statistics:
Total Runs: 143
Successful: 116
Failed: 6
Aborted: 15
Running: 4
Unknown: 2
Failure Rate: 16.1%
Job Types:
presubmit : 122 (85.3%, 16.4% failure rate)
batch : 19 (13.3%, 5.3% failure rate)This helps identify which job types are most stable and which need attention. Note: Pending/running jobs are now correctly excluded from failure statistics.
Analyze test failures across all merge-time jobs using pre-computed data from the ci-health project. Fast analysis of current CI health trends.
# Filter by job name or alias (now uses positional argument)
$ healthcheck merge compute # sig-compute jobs
$ healthcheck merge "sig-compute.*arm64" # ARM64 compute jobs (custom regex)
$ healthcheck merge network # sig-network jobs
$ healthcheck merge "1.6" # release-1.6 jobs
$ healthcheck merge main # main branch jobs
# Available job aliases:
# - main: main branch jobs
# - compute: sig-compute related jobs
# - network: sig-network jobs
# - storage: sig-storage jobs
# - 1.6, 1.5, 1.4: release branch jobs# Count failures by test name
$ healthcheck merge compute -c
3 [sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15182/pull-kubevirt-e2e-k8s-1.31-sig-compute/1945105449749581824
2 [virtctl] [crit:medium][vendor:[email protected]][level:component][sig-compute] usbredir Should work several times
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15110/pull-kubevirt-e2e-k8s-1.32-sig-compute/1943363976574275584
# Show only test names for external processing
$ healthcheck merge compute -n | head -5
[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates
[sig-compute] Infrastructure cluster profiler for pprof data aggregation when ClusterProfiler configuration is enabled it should allow subresource access
[virtctl] [crit:medium][vendor:[email protected]][level:component][sig-compute] usbredir Should work several times
[sig-compute]VirtualMachinePool pool should scale to five, to six and then to zero replicas
[sig-compute] [rfe_id:1177][crit:medium] VirtualMachine with paused vmi [test_id:3229]should gracefully handle being started again
# Show only URLs for browser opening
$ healthcheck merge compute -u | head -3
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15182/pull-kubevirt-e2e-k8s-1.31-sig-compute/1945105449749581824
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15122/pull-kubevirt-e2e-k8s-1.33-sig-compute/1943094557549793280
# Show failure context and stack traces
$ healthcheck merge compute -c -f
3 [sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates
Failure tests/pool_test.go:701
Expected
<int>: 3
to equal
<int>: 4
tests/pool_test.go:760
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792
# Output structured JSON data for machine processing
$ healthcheck merge compute --output json
{
"failed_tests": {
"[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates": [
{
"Name": "[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates",
"URL": "https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/...",
"Failure": {
"Message": "",
"Type": "Failure",
"Value": "Failure tests/pool_test.go:701\nExpected..."
}
}
]
},
"lane_run_failures": {...}
}
# JSON output with count mode
$ healthcheck merge compute -c --output json
{
"test_failure_counts": {
"[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates": 3,
"[virtctl] usbredir Should work several times": 2
},
"failed_tests": {...}
}# Group by lane run UUID for failure correlation
$ healthcheck merge compute --lane-run
Lane Run 1944655730044833792 (3 failures)
[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792
# Highlight quarantined tests
$ healthcheck merge compute -c --quarantine
2 [QUARANTINED] [sig-compute] should include VMI infos for a running VM
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792
# Time filtering (limited to available ci-health data - typically last ~48 hours)
$ healthcheck merge compute --since 2d # Filter by time periodSearch for test failures across all CI jobs using the search.ci.kubevirt.io service. Useful for investigating recurring failures, understanding failure patterns, and finding related issues.
# Search for a specific test failure
$ healthcheck search "Operator should reconcile components"
# Show a concise summary with job breakdown
$ healthcheck search "migration" --summary
# Search within the last 7 days
$ healthcheck search "timeout" --max-age 168h
# Filter to only compute jobs
$ healthcheck search "VMI" --job ".*compute.*"# Count matches per job
$ healthcheck search "network" -c
pull-kubevirt-e2e-k8s-1.32-sig-network: 5 matches
pull-kubevirt-e2e-k8s-1.34-sig-network: 3 matches
# Show only test names
$ healthcheck search "migration" -n
[sig-compute] VirtualMachineInstance migration target should ...
[sig-compute] VirtualMachineInstance migration should migrate ...
# Show only job URLs
$ healthcheck search "storage" -u
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/...
# Show failure context lines
$ healthcheck search "deadline exceeded" -f
# Output structured JSON data for machine processing
$ healthcheck search "compute" --output json
{
"query": "compute",
"time_range": "336h",
"total_matches": 40,
"total_jobs": 10,
"job_stats": { ... },
"job_results": [ ... ]
}
# Include overall statistics (total runs, failure rates)
$ healthcheck search "migration" --summary --stats
# Get the web URL for viewing results in browser
$ healthcheck search "operator" --summary -w
Web URL: https://search.ci.kubevirt.io/?search=operator&...# Default is 14 days (336h)
$ healthcheck search "timeout"
# Search the last 24 hours
$ healthcheck search "panic" --since 24h
# Search the last 7 days
$ healthcheck search "disk" --max-age 168h
# Search the last 48 hours
$ healthcheck search "eviction" --since 2d# Filter to a specific job family
$ healthcheck search "VMI" --job "periodic.*"
# Exclude ARM64 jobs
$ healthcheck search "disk" --exclude-job ".*arm64.*"
# Combine filters
$ healthcheck search "migration" --job ".*compute.*" --exclude-job ".*arm64.*"# Quick overview of current failures across all jobs (ci-health data)
$ healthcheck merge compute -c | head -10
# Deep dive into a specific failing job with historical context (live Prow data)
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 24h --summary
# Open all failure URLs in browser tabs
$ healthcheck merge compute -u | sort | uniq | xargs google-chrome# Compare failure rates over different time periods (live Prow data)
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 24h --summary
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 1w --summary
# Identify most frequent failures across all jobs (ci-health data)
$ healthcheck merge -n | sort | uniq -c | sort -rn | head -10# Find all occurrences of a specific test failure
$ healthcheck merge -n | grep -i "migration"
# Get failure context for debugging
$ healthcheck merge compute -f | grep -A5 -B5 "timeout"
# Analyze quarantined tests
$ healthcheck merge --quarantine -c# Export failure data as JSON for further processing
$ healthcheck merge compute --output json > compute_failures.json
# Export lane analysis as JSON for trending tools
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 7d --summary --output json > lane_trend.json
# Use JSON output with jq for advanced filtering
$ healthcheck merge -c --output json | jq '.test_failure_counts | to_entries[] | select(.value > 5)'
# Export specific failure URLs for automated issue creation
$ healthcheck merge storage -u --output json | jq -r '.urls[]'
# Get test names for automated quarantine decisions
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 3d -c --output json | jq -r '.test_failures | keys[]'# Monitor overall health of different job categories (live Prow data with historical context)
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 1d --summary
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-network --since 1d --summary
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-storage --since 1d --summary
# Track specific job stability over time (weeks of historical data)
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 1w --summary--limit, -l: Number of recent runs to analyze (ignored when --since is used)--since, -s: Fetch all results within time period (e.g., 24h, 2d, 1w) with automatic pagination--type, -t: Filter jobs by type (e.g., batch, presubmit, periodic, postsubmit)--count, -c: Count specific test failures--url, -u: Display only failed job URLs--name, -n: Display only failed test names--failures, -f: Print captured failure context--summary: Display concise summary with failure patterns and statistics (includes per-job-type failure rates)--output, -o: Output format - "text" (default) or "json" for structured data
[job-name-or-alias]: Required positional argument - job regex or alias (compute, network, storage, main, 1.6, 1.5, 1.4)--test, -t: Filter by test name regex--count, -c: Count specific test failures--url, -u: Display only failure URLs--name, -n: Display only test names--failures, -f: Print captured failure context--lane-run, -l: Group failures by lane run UUID--quarantine, -q: Highlight quarantined tests--since, -s: Filter results by time period (limited to available ci-health data ~48h)--summary: Display a concise summary of failures and patterns--output, -o: Output format - "text" (default) or "json" for structured data
[pattern]: Required positional argument - regex pattern to search for in test names or failure messages--max-age: Time range to search (default: "336h" / 14 days; e.g., 24h, 168h)--since, -s: Time range using duration shorthand (e.g., 24h, 2d, 1w) — takes precedence over--max-age--context: Number of context lines to show (default: 1)--type: Search type: junit, bug, bug+junit, build-log, all (default: "junit")--job, -j: Job name filter regex--exclude-job: Job names to exclude (regex)--max-matches: Maximum matches per file (default: 100, max: 500)--count, -c: Count matches per job--url, -u: Display only job URLs--name, -n: Display only test names--failures, -f: Display failure context--summary: Display a concise summary of results--output, -o: Output format - "text" (default) or "json" for structured data--web, -w: Show the web URL for viewing results in browser--stats: Include overall statistics (total runs, failure rates, etc.)
Start a Model Context Protocol (MCP) server that exposes healthcheck functionality to Large Language Models for intelligent CI failure analysis. This enables AI-powered workflows for advanced pattern recognition and automated reporting.
# Start MCP server with stdio transport (default)
$ healthcheck mcp
# Enable debug mode to see available tools
$ healthcheck mcp --debug
Starting healthcheck MCP server...
Available tools:
- analyze_job_lane: Analyze job failures with patterns
- get_job_failures: Get detailed failure information
- analyze_merge_failures: Cross-job failure analysis
- search_failure_patterns: Find patterns across jobs
- compare_time_periods: Compare failure rates over time
- get_failure_source_context: Parse junit failures and generate GitHub URLs
- analyze_failure_trends: Analyze failure trends and patterns over time periods
- analyze_failure_correlation: Analyze failures across multiple jobs to identify systemic issues
- analyze_quarantine_intelligence: Provide intelligent analysis of quarantined tests and recommendations
- assess_failure_impact: Assess the impact and priority of test failures for triage
- generate_failure_report: Generate comprehensive failure analysis report for stakeholders
- fetch_job_run_logs: Fetch logs and artifacts for a specific job run
- search_ci_failures: Search for CI failures using search.ci.kubevirt.ioThe MCP server provides 13 comprehensive tools for enterprise-grade LLM integration:
Analyze recent job runs for a specific CI lane with failure patterns and statistics.
Parameters:
job_name(required): Name of the CI job to analyzesince(optional): Time period to analyze (default: "24h")include_details(optional): Include detailed failure information (default: true)
Get detailed failure information for a specific job with stack traces.
Parameters:
job_name(required): Name of the CI joblimit(optional): Number of recent runs to analyze (default: 10, max: 100)include_stack_traces(optional): Include failure stack traces (default: false)
Analyze test failures across all merge-time jobs using ci-health data.
Parameters:
job_filter(optional): Job filter regex or alias (default: ".*")test_filter(optional): Test name filter regex (default: ".*")include_quarantined(optional): Include quarantined test information (default: true)
Search for specific failure patterns across jobs.
Parameters:
pattern(required): Regex pattern to search for in test names or failure messagesjob_filter(optional): Job filter regex or alias (default: ".*")search_in(optional): Where to search - "test_names", "failure_messages", or "both" (default: "test_names")
Compare failure rates between two time periods for a job.
Parameters:
job_name(required): Name of the CI job to analyzerecent_period(optional): Recent time period (default: "24h")comparison_period(optional): Comparison time period (default: "7d")
Parse JUnit failure output and generate GitHub URLs for source code context with enhanced parsing capabilities.
Enhanced Features:
- Smart format detection: Automatically handles both simple "file:line" format and complex "Type file:line" patterns
- Comprehensive error extraction: Extracts meaningful error messages using pattern matching for common error types
- Multi-file tracking: Captures multiple file references even within the same failure for complete debugging context
- Advanced stack trace parsing: Handles both detailed stack traces and simple file:line references
- GitHub URL generation: Provides actionable GitHub URLs that LLMs can fetch for source code analysis
Parameters:
failure_text(required): JUnit failure text containing file paths and line numbersjob_url(required): Job URL to extract repository and commit informationinclude_stack_trace(optional): Include parsed stack trace information (default: true)
Supported Input Formats:
- Simple format:
pkg/virt-controller/services/template_test.go:2689 - Complex format:
Panic pkg/virt-controller/services/template_test.go:2689 - Multi-line errors with file references throughout the failure text
- Cross-file failures with complete context chain for debugging
Analyze failure trends and patterns over time periods with advanced flakiness detection and pattern recognition.
Parameters:
job_name(required): Name of the CI job to analyzetrend_period(optional): Time period for trend analysis (default: "14d")include_flakiness(optional): Include flakiness analysis (default: true)
Advanced Capabilities:
- Trend direction analysis: Automatically detects improving, degrading, or stable patterns
- Flakiness detection: Identifies intermittent failures with 10-90% failure rate patterns
- Pattern frequency analysis: Tracks failure patterns over time with severity scoring
- Smart recommendations: Differentiates between infrastructure vs code change investigations
Analyze failures across multiple jobs to identify systemic issues and environment-specific patterns.
Parameters:
job_pattern(optional): Job pattern or alias to analyze (default: ".*")time_window(optional): Time window for correlation analysis (default: "24h")include_environment_analysis(optional): Include environment-specific failure analysis (default: true)
Enterprise Features:
- Cross-job correlation: Identifies patterns affecting multiple job types simultaneously
- Environment analysis: ARM64 vs x86, Kubernetes version-specific failures
- Resource issue detection: CPU, memory, disk-related failure patterns
- Systemic issue identification: Infrastructure vs application-level problems
Provide intelligent analysis of quarantined tests with effectiveness scoring and actionable recommendations.
Parameters:
scope(optional): Analysis scope - "all", "job", or specific job name (default: "all")include_recommendations(optional): Include quarantine action recommendations (default: true)
Intelligence Features:
- Effectiveness scoring: Quantifies how well quarantine decisions are working
- Action recommendations: Remove/extend/investigate with detailed reasoning
- Status analysis: Active vs stale quarantine identification
- Impact assessment: How quarantine decisions affect overall CI health
Assess the impact and priority of test failures for intelligent triage and resource allocation.
Parameters:
failure_data(required): JSON failure data from lane or merge commandscontext(optional): Context - "pre-release", "development", "production" (default: "development")include_triage_recommendations(optional): Include triage priority recommendations (default: true)
Triage Intelligence:
- Context-aware prioritization: Different urgency for production vs development
- Business impact analysis: Critical path vs edge case failure identification
- Resource allocation: Senior engineer vs standard triage recommendations
- Priority assignment: Urgent/normal/low with detailed reasoning
Generate comprehensive failure analysis reports for stakeholders with executive summaries and actionable insights.
Parameters:
scope(optional): Report scope - "daily", "weekly", "release", or specific job (default: "daily")format(optional): Report format - "summary", "detailed", "executive" (default: "summary")include_recommendations(optional): Include actionable recommendations (default: true)
Enterprise Reporting:
- Executive summaries: High-level CI health status for management
- Key metrics: Overall health, failure rates, critical issue counts
- Trend analysis: Direction and change percentages over time
- Actionable items: Prioritized next steps for development teams
Fetch logs, artifacts, and test results for a specific Prow job run. Essential for deep-dive analysis of specific test failures.
Parameters:
job_url(required): Prow job URL (presubmit, batch, or periodic format)include_build_log(optional): Parse and include build log summary (default: true)max_build_log_lines(optional): Maximum lines from end of build log to include (default: 50, max: 500)
Supported URL Formats:
- Presubmit:
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/<PR>/<job>/<build-id> - Periodic:
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/logs/<job>/<build-id>
Capabilities:
- JUnit parsing: Automatically fetches and parses
junit.functest.xmlfor test failures - Build log analysis: Identifies errors, panics, timeouts, and infrastructure issues
- Artifact discovery: Lists available artifact files in the job's artifacts directory
- GCS path resolution: Automatically resolves Google Cloud Storage paths from Prow URLs
Search for CI failures across all jobs using the search.ci.kubevirt.io service. Complements search_failure_patterns (which uses ci-health data) with a broader cross-job search.
Parameters:
query(required): Search pattern (regex) to find in test names or failure messagesmax_age(optional): Time range to search (default: "336h" / 14 days)type(optional): Search type — "junit", "bug", "bug+junit", "build-log", "all" (default: "junit")job_filter(optional): Job name filter regexmax_matches(optional): Maximum matches per file (default: 100, max: 500)
The MCP server enables powerful AI-assisted workflows:
# Example prompts you can use with LLM clients:
# "Analyze recent failures in pull-kubevirt-e2e-k8s-1.32-sig-compute"
# "Compare this week's failure rate to last week for unit tests"
# "Find all migration-related failures across all jobs"
# "Generate a release health report for all SIG areas"
# "What are the most critical test failures right now?"
# "Search for timeout-related failures in network tests"
# Enhanced failure source context analysis:
# "Parse this junit failure and show me the GitHub source code where it failed"
# "Extract all file references from this test failure and generate GitHub URLs"
# "Analyze this multi-file failure and provide the complete debugging context"
# "Given this failure text, fetch the source code and explain what might be wrong"
# "Cross-reference this failure with the actual source code to suggest a fix"
# Advanced trend and correlation analysis (NEW):
# "Analyze failure trends for job X over the last 30 days and detect flaky tests"
# "Identify systemic issues affecting multiple jobs in the compute category"
# "Correlate failures across ARM64 and x86 jobs to find environment-specific problems"
# "Generate a comprehensive quarantine intelligence report with effectiveness scores"
# "Assess the business impact of current test failures and prioritize for triage"
# "Create an executive summary of CI health for the weekly engineering meeting"
# "Detect infrastructure vs application-level problems in recent failures"
# "Recommend which quarantined tests should be removed or extended based on effectiveness"All MCP tools return structured JSON data optimized for LLM consumption, including:
- Health status: "critical", "unhealthy", "unstable", "acceptable", "healthy"
- Failure patterns: Categorized by compute, network, storage, migration, operator
- Statistics: Failure rates, run counts, unique test counts
- Trends: Improvement/regression detection, stability analysis, flakiness scoring
- Recommendations: Actionable next steps based on failure patterns and context
- Time analysis: Duration calculations, period comparisons, trend directions
- Potential causes: Inferred based on test names and failure patterns
- Enhanced source context: GitHub URLs, file paths, line numbers, error messages, and stack traces
- Multi-file debugging: Complete context chain with cross-file failure references
- Advanced analytics: Correlation analysis, quarantine intelligence, impact assessment
- Enterprise reporting: Executive summaries, key metrics, trend analysis, actionable items
- Context-aware prioritization: Business impact, triage recommendations, resource allocation
- Environment analysis: Architecture-specific failures, Kubernetes version patterns
- Flakiness detection: 10-90% failure rate patterns with frequency analysis
--port, -p: Port to listen on (0 for stdio, default: 0)--host, -H: Host to bind to (default: "localhost")--stdio, -s: Use stdio transport (default: true)--debug, -d: Enable debug logging to see tool information
To use this MCP server with Claude CLI or Claude Desktop, you need to configure it as an MCP server in your settings:
Add to your ~/.config/claude-cli/mcp_servers.json:
{
"kubevirt-healthcheck": {
"command": "/path/to/healthcheck",
"args": ["mcp"],
"env": {}
}
}Then enable it with:
claude mcp install kubevirt-healthcheckAdd to your Claude Desktop MCP settings (typically ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
{
"mcpServers": {
"kubevirt-healthcheck": {
"command": "/path/to/healthcheck",
"args": ["mcp"]
}
}
}Once configured, you can use natural language prompts with Claude:
- "Analyze recent failures in pull-kubevirt-e2e-k8s-1.32-sig-compute and tell me what's broken"
- "Compare failure rates between this week and last week for unit tests"
- "Generate a comprehensive CI health report for all KubeVirt job categories"
- "Find all timeout-related failures across network tests and suggest fixes"
- "What are the top 3 most critical test failures I should prioritize?"
- "Search for migration failures and group them by potential root cause"
The MCP server provides Claude with direct access to KubeVirt CI data, enabling sophisticated analysis, pattern recognition, and actionable recommendations that would be difficult to achieve with manual investigation.