Skip to content

lyarwood/healthcheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

healthcheck

A command line tool to analyze KubeVirt CI failures using two complementary data sources and analysis approaches.

Data Sources & Approaches

This tool provides two distinct commands that use different data sources:

merge - CI-Health Aggregated Data

  • Data Source: Uses pre-aggregated failure data from kubevirt/ci-health JSON API
  • Coverage: Analyzes failures across all merge-time jobs (main branch, release branches)
  • Time Range: Limited to the data available in ci-health (typically recent failures)
  • Performance: Fast - processes pre-computed aggregations
  • Use Case: Quick overview of current CI health across all job types

lane - Live Prow Data Crawling

  • Data Source: Crawls live Prow web pages and fetches individual job artifacts from multiple sources (presubmit, batch, periodic)
  • Coverage: Analyzes any specific job lane in real-time, including batch jobs
  • Time Range: Flexible - can go back weeks/months with automatic pagination
  • Performance: Slower - fetches and parses individual job data on-demand
  • Use Case: Deep dive analysis of specific job lanes with historical data
  • Job Types: Supports presubmit, batch, periodic, and postsubmit jobs with per-type failure statistics

Installation

go build
./healthcheck --help

Lane Command - Live Prow Data Analysis

Analyze recent job runs for a specific CI lane by crawling live Prow web pages and artifacts. Provides real-time data with flexible time ranges and automatic pagination.

Basic Usage

# Analyze recent runs for a specific job
$ healthcheck lane pull-kubevirt-unit-test-arm64

# Limit to specific number of runs (default: 10)
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --limit 20

Output Formats

# Count test failures across runs
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 5 -c
2	VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts

	https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920

	https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15447/pull-kubevirt-unit-test-arm64/1958193812496977920

1	Migration watcher Migration backoff should not be applied if it is not an evacuation with workload update annotation

	https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15388/pull-kubevirt-unit-test-arm64/1958193968416034816

# Show only test names
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 -n
VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts
Migration watcher Migration backoff should not be applied if it is not an evacuation with workload update annotation

# Show only failed job URLs
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 -u
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15388/pull-kubevirt-unit-test-arm64/1958193968416034816

# Show failure details with context
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 -f
VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts
goroutine 1847 [running]:
testing.tRunner.func1.2({0x2b2e5a0, 0xc001638690})
	/opt/hostedtoolcache/go/1.21.13/x64/lib/go/src/testing/testing.go:1631 +0x2ff
...

https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920

# Output structured JSON data for machine processing
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 3 --output json
{
  "job_name": "pull-kubevirt-unit-test-arm64",
  "all_failures": [
    {
      "Name": "VirtualMachineInstance migration target DomainNotifyServerRestarts...",
      "URL": "https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/...",
      "Failure": "goroutine 1847 [running]:\ntesting.tRunner.func1.2..."
    }
  ]
}

# JSON output with count mode
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 10 -c --output json
{
  "job_name": "pull-kubevirt-unit-test-arm64",
  "test_failures": {
    "Test Name 1": [
      {"Name": "Test Name 1", "URL": "...", "Failure": "..."},
      {"Name": "Test Name 1", "URL": "...", "Failure": "..."}
    ]
  }
}

Time-Based Analysis (Automatic Pagination)

The --since flag automatically paginates to find ALL results within the time period, ignoring --limit.

# Find all failures in the last hour
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 1h -c
1	VirtualMachineInstance migration target DomainNotifyServerRestarts should establish a notify server pipe should be resilient to notify server restarts

	https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15455/pull-kubevirt-unit-test-arm64/1958202806657617920

# Analyze longer time periods - automatically finds all results
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 2d --summary
Lane Summary: pull-kubevirt-unit-test-arm64
===========================================

Time Range:
  First Run:  2025-08-18 19:03:13 UTC
  Last Run:   2025-08-20 16:22:12 UTC
  Duration:   1.9 days

Test Run Statistics:
  Total Runs:     92
  Successful:     62
  Failed:         15
  Unknown:        15
  Failure Rate:   16.3%

Test Failure Statistics:
  Total Failures: 78
  Unique Tests:   70

Failure Categories:
  migration : 8 (10.3%)
  general   : 3 (3.8%)
  storage   : 2 (2.6%)

Most Frequent Failures:
  1. [migration] VirtualMachineInstance migration target DomainNotifyServe... (8 failures, 10.3%)
  2. [general] VirtualMachineInstance watcher On valid VirtualMachineIns... (2 failures, 2.6%)
  3. [storage] VirtualMachineInstance watcher On valid VirtualMachineIns... (1 failures, 1.3%)

Pattern Analysis:
  🟢 Very low failure rate - stable
  🔀 Diverse failure patterns - no clear dominant issue

# Time period examples
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 6h    # Last 6 hours
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 3d             # Last 3 days  
$ healthcheck lane pull-kubevirt-e2e-k8s-1.31-sig-storage --since 1w    # Last week

Summary Analysis

# Get comprehensive failure pattern analysis
$ healthcheck lane pull-kubevirt-unit-test-arm64 --limit 25 --summary
Lane Summary: pull-kubevirt-unit-test-arm64
===========================================

Time Range:
  First Run:  2025-08-20 09:48:24 UTC
  Last Run:   2025-08-20 16:22:12 UTC
  Duration:   6.6 hours

Test Run Statistics:
  Total Runs:     25
  Successful:     16
  Failed:         7
  Running:        2
  Failure Rate:   28.0%
  Job Types:
    presubmit   : 20 (80.0%, 30.0% failure rate)
    batch       : 5 (20.0%, 20.0% failure rate)

Individual Runs:
  1. ✓ SUCCESS    [presubmit]
     https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/...
  2. ⋯ PENDING    [presubmit]
     https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/...
  3. ✗ FAILURE    [presubmit] - 2 failure(s)
     https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/...
  ... (truncated for brevity)

Failure Analysis:
  Total Failures: 28
  Unique Tests:   25
  Infrastructure: 42.9% of all failures

Failure Categories:
  migration : 4 (14.3%)
  general   : 1 (3.6%)
  storage   : 3 (10.7%)

Most Frequent Failures:
  1. [migration] VirtualMachineInstance migration target DomainNotifyServe... (4 failures, 14.3%)
  2. [general] VirtualMachineInstance watcher On valid VirtualMachineIns... (1 failures, 3.6%)
  3. [storage] VirtualMachineInstance watcher Aggregating DataVolume con... (1 failures, 3.6%)

Pattern Analysis:
  🟠 Low failure rate - normal fluctuation
  🔀 Diverse failure patterns - no clear dominant issue

Job Type Filtering

The lane command now supports filtering by job type to analyze specific CI categories:

# Filter by batch jobs only
$ healthcheck lane pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64 --type batch --summary -s 7d
Lane Summary: pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64
==========================================================

Test Run Statistics:
  Total Runs:     19
  Successful:     16
  Failed:         1
  Running:        2
  Failure Rate:   5.3%
  Job Types:
    batch       : 19 (100.0%, 5.3% failure rate)

# Filter by presubmit jobs only
$ healthcheck lane pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64 --type presubmit --summary -s 7d

# Filter by periodic or postsubmit jobs
$ healthcheck lane periodic-kubevirt-e2e-k8s-1.32-sig-network --type periodic --summary

This enables comparison of failure rates between different job types and helps identify if certain types (e.g., batch vs presubmit) have different failure characteristics.

Job Type Statistics

Lane summaries now include per-job-type statistics showing both distribution and failure rates:

$ healthcheck lane pull-kubevirt-e2e-k8s-1.34-sig-compute-arm64 --summary -s 7d
...
Test Run Statistics:
  Total Runs:     143
  Successful:     116
  Failed:         6
  Aborted:        15
  Running:        4
  Unknown:        2
  Failure Rate:   16.1%
  Job Types:
    presubmit   : 122 (85.3%, 16.4% failure rate)
    batch       : 19 (13.3%, 5.3% failure rate)

This helps identify which job types are most stable and which need attention. Note: Pending/running jobs are now correctly excluded from failure statistics.


Merge Command - CI-Health Aggregated Analysis

Analyze test failures across all merge-time jobs using pre-computed data from the ci-health project. Fast analysis of current CI health trends.

Job Filtering

# Filter by job name or alias (now uses positional argument)
$ healthcheck merge compute                       # sig-compute jobs
$ healthcheck merge "sig-compute.*arm64"          # ARM64 compute jobs (custom regex)
$ healthcheck merge network                       # sig-network jobs
$ healthcheck merge "1.6"                         # release-1.6 jobs
$ healthcheck merge main                          # main branch jobs

# Available job aliases:
# - main: main branch jobs
# - compute: sig-compute related jobs  
# - network: sig-network jobs
# - storage: sig-storage jobs
# - 1.6, 1.5, 1.4: release branch jobs

Output Formats

# Count failures by test name
$ healthcheck merge compute -c
3	[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates

	https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792

	https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15182/pull-kubevirt-e2e-k8s-1.31-sig-compute/1945105449749581824

2	[virtctl] [crit:medium][vendor:[email protected]][level:component][sig-compute] usbredir Should work several times

	https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15110/pull-kubevirt-e2e-k8s-1.32-sig-compute/1943363976574275584

# Show only test names for external processing
$ healthcheck merge compute -n | head -5
[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates
[sig-compute] Infrastructure cluster profiler for pprof data aggregation when ClusterProfiler configuration is enabled it should allow subresource access
[virtctl] [crit:medium][vendor:[email protected]][level:component][sig-compute] usbredir Should work several times
[sig-compute]VirtualMachinePool pool should scale to five, to six and then to zero replicas
[sig-compute] [rfe_id:1177][crit:medium] VirtualMachine with paused vmi [test_id:3229]should gracefully handle being started again

# Show only URLs for browser opening
$ healthcheck merge compute -u | head -3
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15182/pull-kubevirt-e2e-k8s-1.31-sig-compute/1945105449749581824
https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15122/pull-kubevirt-e2e-k8s-1.33-sig-compute/1943094557549793280

# Show failure context and stack traces
$ healthcheck merge compute -c -f
3	[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates

	Failure tests/pool_test.go:701
	Expected
	    <int>: 3
	to equal
	    <int>: 4
	tests/pool_test.go:760

	https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792

# Output structured JSON data for machine processing
$ healthcheck merge compute --output json
{
  "failed_tests": {
    "[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates": [
      {
        "Name": "[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates",
        "URL": "https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/...",
        "Failure": {
          "Message": "",
          "Type": "Failure",
          "Value": "Failure tests/pool_test.go:701\nExpected..."
        }
      }
    ]
  },
  "lane_run_failures": {...}
}

# JSON output with count mode
$ healthcheck merge compute -c --output json
{
  "test_failure_counts": {
    "[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates": 3,
    "[virtctl] usbredir Should work several times": 2
  },
  "failed_tests": {...}
}

Advanced Features

# Group by lane run UUID for failure correlation
$ healthcheck merge compute --lane-run
Lane Run 1944655730044833792 (3 failures)

	[sig-compute]VirtualMachinePool should respect maxUnavailable strategy during updates
	https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792

# Highlight quarantined tests
$ healthcheck merge compute -c --quarantine
2	[QUARANTINED] [sig-compute] should include VMI infos for a running VM

	https://prow.ci.kubevirt.io//view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/15098/pull-kubevirt-e2e-k8s-1.32-sig-compute/1944655730044833792

# Time filtering (limited to available ci-health data - typically last ~48 hours)
$ healthcheck merge compute --since 2d       # Filter by time period

Search Command - CI Failure Search

Search for test failures across all CI jobs using the search.ci.kubevirt.io service. Useful for investigating recurring failures, understanding failure patterns, and finding related issues.

Basic Usage

# Search for a specific test failure
$ healthcheck search "Operator should reconcile components"

# Show a concise summary with job breakdown
$ healthcheck search "migration" --summary

# Search within the last 7 days
$ healthcheck search "timeout" --max-age 168h

# Filter to only compute jobs
$ healthcheck search "VMI" --job ".*compute.*"

Output Formats

# Count matches per job
$ healthcheck search "network" -c
pull-kubevirt-e2e-k8s-1.32-sig-network: 5 matches
pull-kubevirt-e2e-k8s-1.34-sig-network: 3 matches

# Show only test names
$ healthcheck search "migration" -n
[sig-compute] VirtualMachineInstance migration target should ...
[sig-compute] VirtualMachineInstance migration should migrate ...

# Show only job URLs
$ healthcheck search "storage" -u
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/...

# Show failure context lines
$ healthcheck search "deadline exceeded" -f

# Output structured JSON data for machine processing
$ healthcheck search "compute" --output json
{
  "query": "compute",
  "time_range": "336h",
  "total_matches": 40,
  "total_jobs": 10,
  "job_stats": { ... },
  "job_results": [ ... ]
}

# Include overall statistics (total runs, failure rates)
$ healthcheck search "migration" --summary --stats

# Get the web URL for viewing results in browser
$ healthcheck search "operator" --summary -w
Web URL: https://search.ci.kubevirt.io/?search=operator&...

Time Range Control

# Default is 14 days (336h)
$ healthcheck search "timeout"

# Search the last 24 hours
$ healthcheck search "panic" --since 24h

# Search the last 7 days
$ healthcheck search "disk" --max-age 168h

# Search the last 48 hours
$ healthcheck search "eviction" --since 2d

Job Filtering

# Filter to a specific job family
$ healthcheck search "VMI" --job "periodic.*"

# Exclude ARM64 jobs
$ healthcheck search "disk" --exclude-job ".*arm64.*"

# Combine filters
$ healthcheck search "migration" --job ".*compute.*" --exclude-job ".*arm64.*"

Practical Workflows

Daily Failure Triage

# Quick overview of current failures across all jobs (ci-health data)
$ healthcheck merge compute -c | head -10

# Deep dive into a specific failing job with historical context (live Prow data)
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 24h --summary

# Open all failure URLs in browser tabs
$ healthcheck merge compute -u | sort | uniq | xargs google-chrome

Trend Analysis

# Compare failure rates over different time periods (live Prow data)
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 24h --summary
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 1w --summary

# Identify most frequent failures across all jobs (ci-health data)
$ healthcheck merge -n | sort | uniq -c | sort -rn | head -10

Debugging Specific Issues

# Find all occurrences of a specific test failure
$ healthcheck merge -n | grep -i "migration"

# Get failure context for debugging
$ healthcheck merge compute -f | grep -A5 -B5 "timeout"

# Analyze quarantined tests
$ healthcheck merge --quarantine -c

Machine Processing and Automation

# Export failure data as JSON for further processing
$ healthcheck merge compute --output json > compute_failures.json

# Export lane analysis as JSON for trending tools
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 7d --summary --output json > lane_trend.json

# Use JSON output with jq for advanced filtering
$ healthcheck merge -c --output json | jq '.test_failure_counts | to_entries[] | select(.value > 5)'

# Export specific failure URLs for automated issue creation
$ healthcheck merge storage -u --output json | jq -r '.urls[]'

# Get test names for automated quarantine decisions
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 3d -c --output json | jq -r '.test_failures | keys[]'

CI Health Monitoring

# Monitor overall health of different job categories (live Prow data with historical context)
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-compute --since 1d --summary
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-network --since 1d --summary  
$ healthcheck lane pull-kubevirt-e2e-k8s-1.32-sig-storage --since 1d --summary

# Track specific job stability over time (weeks of historical data)
$ healthcheck lane pull-kubevirt-unit-test-arm64 --since 1w --summary

Command Reference

Lane Command Flags (Live Prow Data)

  • --limit, -l: Number of recent runs to analyze (ignored when --since is used)
  • --since, -s: Fetch all results within time period (e.g., 24h, 2d, 1w) with automatic pagination
  • --type, -t: Filter jobs by type (e.g., batch, presubmit, periodic, postsubmit)
  • --count, -c: Count specific test failures
  • --url, -u: Display only failed job URLs
  • --name, -n: Display only failed test names
  • --failures, -f: Print captured failure context
  • --summary: Display concise summary with failure patterns and statistics (includes per-job-type failure rates)
  • --output, -o: Output format - "text" (default) or "json" for structured data

Merge Command Flags (CI-Health Data)

  • [job-name-or-alias]: Required positional argument - job regex or alias (compute, network, storage, main, 1.6, 1.5, 1.4)
  • --test, -t: Filter by test name regex
  • --count, -c: Count specific test failures
  • --url, -u: Display only failure URLs
  • --name, -n: Display only test names
  • --failures, -f: Print captured failure context
  • --lane-run, -l: Group failures by lane run UUID
  • --quarantine, -q: Highlight quarantined tests
  • --since, -s: Filter results by time period (limited to available ci-health data ~48h)
  • --summary: Display a concise summary of failures and patterns
  • --output, -o: Output format - "text" (default) or "json" for structured data

Search Command Flags (CI Search)

  • [pattern]: Required positional argument - regex pattern to search for in test names or failure messages
  • --max-age: Time range to search (default: "336h" / 14 days; e.g., 24h, 168h)
  • --since, -s: Time range using duration shorthand (e.g., 24h, 2d, 1w) — takes precedence over --max-age
  • --context: Number of context lines to show (default: 1)
  • --type: Search type: junit, bug, bug+junit, build-log, all (default: "junit")
  • --job, -j: Job name filter regex
  • --exclude-job: Job names to exclude (regex)
  • --max-matches: Maximum matches per file (default: 100, max: 500)
  • --count, -c: Count matches per job
  • --url, -u: Display only job URLs
  • --name, -n: Display only test names
  • --failures, -f: Display failure context
  • --summary: Display a concise summary of results
  • --output, -o: Output format - "text" (default) or "json" for structured data
  • --web, -w: Show the web URL for viewing results in browser
  • --stats: Include overall statistics (total runs, failure rates, etc.)

MCP Command - LLM-Assisted CI Analysis

Start a Model Context Protocol (MCP) server that exposes healthcheck functionality to Large Language Models for intelligent CI failure analysis. This enables AI-powered workflows for advanced pattern recognition and automated reporting.

Starting the MCP Server

# Start MCP server with stdio transport (default)
$ healthcheck mcp

# Enable debug mode to see available tools
$ healthcheck mcp --debug
Starting healthcheck MCP server...
Available tools:
- analyze_job_lane: Analyze job failures with patterns
- get_job_failures: Get detailed failure information
- analyze_merge_failures: Cross-job failure analysis
- search_failure_patterns: Find patterns across jobs
- compare_time_periods: Compare failure rates over time
- get_failure_source_context: Parse junit failures and generate GitHub URLs
- analyze_failure_trends: Analyze failure trends and patterns over time periods
- analyze_failure_correlation: Analyze failures across multiple jobs to identify systemic issues
- analyze_quarantine_intelligence: Provide intelligent analysis of quarantined tests and recommendations
- assess_failure_impact: Assess the impact and priority of test failures for triage
- generate_failure_report: Generate comprehensive failure analysis report for stakeholders
- fetch_job_run_logs: Fetch logs and artifacts for a specific job run
- search_ci_failures: Search for CI failures using search.ci.kubevirt.io

Available MCP Tools

The MCP server provides 13 comprehensive tools for enterprise-grade LLM integration:

1. analyze_job_lane

Analyze recent job runs for a specific CI lane with failure patterns and statistics.

Parameters:

  • job_name (required): Name of the CI job to analyze
  • since (optional): Time period to analyze (default: "24h")
  • include_details (optional): Include detailed failure information (default: true)

2. get_job_failures

Get detailed failure information for a specific job with stack traces.

Parameters:

  • job_name (required): Name of the CI job
  • limit (optional): Number of recent runs to analyze (default: 10, max: 100)
  • include_stack_traces (optional): Include failure stack traces (default: false)

3. analyze_merge_failures

Analyze test failures across all merge-time jobs using ci-health data.

Parameters:

  • job_filter (optional): Job filter regex or alias (default: ".*")
  • test_filter (optional): Test name filter regex (default: ".*")
  • include_quarantined (optional): Include quarantined test information (default: true)

4. search_failure_patterns

Search for specific failure patterns across jobs.

Parameters:

  • pattern (required): Regex pattern to search for in test names or failure messages
  • job_filter (optional): Job filter regex or alias (default: ".*")
  • search_in (optional): Where to search - "test_names", "failure_messages", or "both" (default: "test_names")

5. compare_time_periods

Compare failure rates between two time periods for a job.

Parameters:

  • job_name (required): Name of the CI job to analyze
  • recent_period (optional): Recent time period (default: "24h")
  • comparison_period (optional): Comparison time period (default: "7d")

6. get_failure_source_context

Parse JUnit failure output and generate GitHub URLs for source code context with enhanced parsing capabilities.

Enhanced Features:

  • Smart format detection: Automatically handles both simple "file:line" format and complex "Type file:line" patterns
  • Comprehensive error extraction: Extracts meaningful error messages using pattern matching for common error types
  • Multi-file tracking: Captures multiple file references even within the same failure for complete debugging context
  • Advanced stack trace parsing: Handles both detailed stack traces and simple file:line references
  • GitHub URL generation: Provides actionable GitHub URLs that LLMs can fetch for source code analysis

Parameters:

  • failure_text (required): JUnit failure text containing file paths and line numbers
  • job_url (required): Job URL to extract repository and commit information
  • include_stack_trace (optional): Include parsed stack trace information (default: true)

Supported Input Formats:

  • Simple format: pkg/virt-controller/services/template_test.go:2689
  • Complex format: Panic pkg/virt-controller/services/template_test.go:2689
  • Multi-line errors with file references throughout the failure text
  • Cross-file failures with complete context chain for debugging

7. analyze_failure_trends

Analyze failure trends and patterns over time periods with advanced flakiness detection and pattern recognition.

Parameters:

  • job_name (required): Name of the CI job to analyze
  • trend_period (optional): Time period for trend analysis (default: "14d")
  • include_flakiness (optional): Include flakiness analysis (default: true)

Advanced Capabilities:

  • Trend direction analysis: Automatically detects improving, degrading, or stable patterns
  • Flakiness detection: Identifies intermittent failures with 10-90% failure rate patterns
  • Pattern frequency analysis: Tracks failure patterns over time with severity scoring
  • Smart recommendations: Differentiates between infrastructure vs code change investigations

8. analyze_failure_correlation

Analyze failures across multiple jobs to identify systemic issues and environment-specific patterns.

Parameters:

  • job_pattern (optional): Job pattern or alias to analyze (default: ".*")
  • time_window (optional): Time window for correlation analysis (default: "24h")
  • include_environment_analysis (optional): Include environment-specific failure analysis (default: true)

Enterprise Features:

  • Cross-job correlation: Identifies patterns affecting multiple job types simultaneously
  • Environment analysis: ARM64 vs x86, Kubernetes version-specific failures
  • Resource issue detection: CPU, memory, disk-related failure patterns
  • Systemic issue identification: Infrastructure vs application-level problems

9. analyze_quarantine_intelligence

Provide intelligent analysis of quarantined tests with effectiveness scoring and actionable recommendations.

Parameters:

  • scope (optional): Analysis scope - "all", "job", or specific job name (default: "all")
  • include_recommendations (optional): Include quarantine action recommendations (default: true)

Intelligence Features:

  • Effectiveness scoring: Quantifies how well quarantine decisions are working
  • Action recommendations: Remove/extend/investigate with detailed reasoning
  • Status analysis: Active vs stale quarantine identification
  • Impact assessment: How quarantine decisions affect overall CI health

10. assess_failure_impact

Assess the impact and priority of test failures for intelligent triage and resource allocation.

Parameters:

  • failure_data (required): JSON failure data from lane or merge commands
  • context (optional): Context - "pre-release", "development", "production" (default: "development")
  • include_triage_recommendations (optional): Include triage priority recommendations (default: true)

Triage Intelligence:

  • Context-aware prioritization: Different urgency for production vs development
  • Business impact analysis: Critical path vs edge case failure identification
  • Resource allocation: Senior engineer vs standard triage recommendations
  • Priority assignment: Urgent/normal/low with detailed reasoning

11. generate_failure_report

Generate comprehensive failure analysis reports for stakeholders with executive summaries and actionable insights.

Parameters:

  • scope (optional): Report scope - "daily", "weekly", "release", or specific job (default: "daily")
  • format (optional): Report format - "summary", "detailed", "executive" (default: "summary")
  • include_recommendations (optional): Include actionable recommendations (default: true)

Enterprise Reporting:

  • Executive summaries: High-level CI health status for management
  • Key metrics: Overall health, failure rates, critical issue counts
  • Trend analysis: Direction and change percentages over time
  • Actionable items: Prioritized next steps for development teams

12. fetch_job_run_logs

Fetch logs, artifacts, and test results for a specific Prow job run. Essential for deep-dive analysis of specific test failures.

Parameters:

  • job_url (required): Prow job URL (presubmit, batch, or periodic format)
  • include_build_log (optional): Parse and include build log summary (default: true)
  • max_build_log_lines (optional): Maximum lines from end of build log to include (default: 50, max: 500)

Supported URL Formats:

  • Presubmit: https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/<PR>/<job>/<build-id>
  • Periodic: https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/logs/<job>/<build-id>

Capabilities:

  • JUnit parsing: Automatically fetches and parses junit.functest.xml for test failures
  • Build log analysis: Identifies errors, panics, timeouts, and infrastructure issues
  • Artifact discovery: Lists available artifact files in the job's artifacts directory
  • GCS path resolution: Automatically resolves Google Cloud Storage paths from Prow URLs

13. search_ci_failures

Search for CI failures across all jobs using the search.ci.kubevirt.io service. Complements search_failure_patterns (which uses ci-health data) with a broader cross-job search.

Parameters:

  • query (required): Search pattern (regex) to find in test names or failure messages
  • max_age (optional): Time range to search (default: "336h" / 14 days)
  • type (optional): Search type — "junit", "bug", "bug+junit", "build-log", "all" (default: "junit")
  • job_filter (optional): Job name filter regex
  • max_matches (optional): Maximum matches per file (default: 100, max: 500)

LLM Integration Examples

The MCP server enables powerful AI-assisted workflows:

# Example prompts you can use with LLM clients:
# "Analyze recent failures in pull-kubevirt-e2e-k8s-1.32-sig-compute"
# "Compare this week's failure rate to last week for unit tests"  
# "Find all migration-related failures across all jobs"
# "Generate a release health report for all SIG areas"
# "What are the most critical test failures right now?"
# "Search for timeout-related failures in network tests"

# Enhanced failure source context analysis:
# "Parse this junit failure and show me the GitHub source code where it failed"
# "Extract all file references from this test failure and generate GitHub URLs"
# "Analyze this multi-file failure and provide the complete debugging context"
# "Given this failure text, fetch the source code and explain what might be wrong"
# "Cross-reference this failure with the actual source code to suggest a fix"

# Advanced trend and correlation analysis (NEW):
# "Analyze failure trends for job X over the last 30 days and detect flaky tests"
# "Identify systemic issues affecting multiple jobs in the compute category"
# "Correlate failures across ARM64 and x86 jobs to find environment-specific problems"
# "Generate a comprehensive quarantine intelligence report with effectiveness scores"
# "Assess the business impact of current test failures and prioritize for triage"
# "Create an executive summary of CI health for the weekly engineering meeting"
# "Detect infrastructure vs application-level problems in recent failures"
# "Recommend which quarantined tests should be removed or extended based on effectiveness"

Data Format

All MCP tools return structured JSON data optimized for LLM consumption, including:

  • Health status: "critical", "unhealthy", "unstable", "acceptable", "healthy"
  • Failure patterns: Categorized by compute, network, storage, migration, operator
  • Statistics: Failure rates, run counts, unique test counts
  • Trends: Improvement/regression detection, stability analysis, flakiness scoring
  • Recommendations: Actionable next steps based on failure patterns and context
  • Time analysis: Duration calculations, period comparisons, trend directions
  • Potential causes: Inferred based on test names and failure patterns
  • Enhanced source context: GitHub URLs, file paths, line numbers, error messages, and stack traces
  • Multi-file debugging: Complete context chain with cross-file failure references
  • Advanced analytics: Correlation analysis, quarantine intelligence, impact assessment
  • Enterprise reporting: Executive summaries, key metrics, trend analysis, actionable items
  • Context-aware prioritization: Business impact, triage recommendations, resource allocation
  • Environment analysis: Architecture-specific failures, Kubernetes version patterns
  • Flakiness detection: 10-90% failure rate patterns with frequency analysis

MCP Command Flags

  • --port, -p: Port to listen on (0 for stdio, default: 0)
  • --host, -H: Host to bind to (default: "localhost")
  • --stdio, -s: Use stdio transport (default: true)
  • --debug, -d: Enable debug logging to see tool information

Integration with Claude CLI/Desktop

To use this MCP server with Claude CLI or Claude Desktop, you need to configure it as an MCP server in your settings:

Claude CLI Configuration

Add to your ~/.config/claude-cli/mcp_servers.json:

{
  "kubevirt-healthcheck": {
    "command": "/path/to/healthcheck",
    "args": ["mcp"],
    "env": {}
  }
}

Then enable it with:

claude mcp install kubevirt-healthcheck

Claude Desktop Configuration

Add to your Claude Desktop MCP settings (typically ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "kubevirt-healthcheck": {
      "command": "/path/to/healthcheck",
      "args": ["mcp"]
    }
  }
}

Usage with Claude

Once configured, you can use natural language prompts with Claude:

  • "Analyze recent failures in pull-kubevirt-e2e-k8s-1.32-sig-compute and tell me what's broken"
  • "Compare failure rates between this week and last week for unit tests"
  • "Generate a comprehensive CI health report for all KubeVirt job categories"
  • "Find all timeout-related failures across network tests and suggest fixes"
  • "What are the top 3 most critical test failures I should prioritize?"
  • "Search for migration failures and group them by potential root cause"

The MCP server provides Claude with direct access to KubeVirt CI data, enabling sophisticated analysis, pattern recognition, and actionable recommendations that would be difficult to achieve with manual investigation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors