Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions skills/flaky-test-judge/REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Flaky Test Judge delivery report

- Published `bbbbzzzzcc-afk/flaky-test-judge@sha-cd2fdb45ec9e` to the public Runx registry and confirmed the immutable package can be read back and installed into an empty directory.
- Opened upstream pull request [runxhq/runx#197](https://github.com/runxhq/runx/pull/197) with the skill definition, operator documentation, and exactly two harness cases.
- Exercised the published package against 20 supplied runs: 13 passes, six explicit timeout failures, and one assertion failure. The resulting pass rate is 65%.
- The reviewed disposition is `quarantine` with confidence `0.96`, a seven-day ceiling, an explicit pytest exclusion marker, and downstream handoff target `issue-to-pr`.
- The empty-history boundary refuses to invent evidence. The harness leaves it at `needs_agent`, and the independent fixture requires a `missing-evidence` stop, no quarantine object, and no dispatch target.
- The skill is evidence-only. It does not edit repositories, change CI, disable a test, open an issue, create a pull request, or merge code.
- The dogfood run emitted receipt `runx:receipt:c4bb972e74c43278b503eba0cc076b00264582addfce9a72eeadac8674f92550`.
- Independent verification of that receipt returned `valid: true` in production mode with no findings.
- Raw machine-readable evidence is in [`evidence.json`](./evidence.json), and the verification matrix is in [`verification.json`](./verification.json).
135 changes: 135 additions & 0 deletions skills/flaky-test-judge/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
name: flaky-test-judge
description: Judge supplied test-run history and emit a bounded flaky-test disposition without mutating a repository.
runx:
category: ops
---

# Flaky Test Judge

Decide whether one test should be quarantined temporarily, treated as
environmental noise, fixed as a real bug, or escalated for human review.

The skill reads only the supplied run history, test metadata, and release
policy. It emits a `runx.flaky.test_triage.v1` handoff packet. It never edits a
test, changes CI configuration, opens an issue, or starts a pull request.

## Inputs

- `test_run_history`: ordered `runs` containing `status`, `duration`, and
`logs`, plus the declared `sample_size`, `window_start`, and `window_end`.
- `test_metadata`: `test_path`, `suite`, and optional `tags`.
- `release_policy`: `flake_threshold_pct`, `min_sample_size`, and
`max_quarantine_days`.

## Evidence calculation

1. Count only supplied runs.
2. Verify `sample_size` equals the number of supplied runs.
3. Compute pass rate as `passing runs / total runs * 100`.
4. Count failure modes only when their evidence is visible in supplied logs.
5. Cite run indexes or exact log fragments for every failure-mode claim.

Do not infer hidden retries, infrastructure incidents, or product defects.

## Disposition rules

- **Stop:** no run history, a sample-size mismatch, or fewer runs than
`min_sample_size`. Use a `missing-evidence` reason and emit no quarantine
packet.
- **Keep enabled:** pass rate is at or above `flake_threshold_pct`. Refuse to
quarantine.
- **Human review:** evidence is near the threshold, failure modes conflict, or
logs do not distinguish environmental noise from a real break.
- **Fix now:** supplied evidence consistently identifies a reproducible product
or test defect rather than intermittent environmental behavior.
- **Ignore environmental noise:** failures are clearly external and policy
allows no repository action.
- **Quarantine:** pass rate is below the policy threshold, the sample is
sufficient, intermittent failure evidence is explicit, and a bounded
temporary exclusion is justified.

Confidence must reflect the supplied evidence. A classification cannot exceed
the strength of its cited logs.

## Quarantine packet

When quarantine is justified, include:

```yaml
schema: runx.flaky.test_triage.v1
disposition:
decision: quarantine
confidence: 0.96
reason: 65% pass rate over 20 runs; six of seven failures are explicit timeouts.
quarantine:
test_path: tests/integration/test_checkout.py::test_retries_expired_session
duration_days: 7
exclusion_marker: '@pytest.mark.skip(reason="quarantined: tracked flaky timeout")'
fix_template:
thread_title: Fix flaky checkout retry timeout
thread_body: >
Temporarily exclude the named test, preserve the cited run evidence, and
remove the exclusion when the timeout cause is fixed.
target_repo: example/project
base: main
escalation:
required: false
lane: human_review
reason: ""
dispatch_target: issue-to-pr
```

`duration_days` must be positive and must never exceed
`release_policy.max_quarantine_days`. The packet names `issue-to-pr` only as a
downstream dispatch target. A separate governed run must map the fix template
to `thread_title`, `thread_body`, `target_repo`, and `base`.

The downstream run drafts the change. A human merge gate is the only path to a
live test disable.

## Refusal and escalation

- Refuse quarantine when the pass rate is at or above the threshold.
- Stop when no runs are supplied or the sample is below policy minimum.
- Escalate near-threshold or conflicting evidence.
- Never invent a failure mode absent from the logs.
- Never exceed the quarantine duration ceiling.
- Never mutate a repository or consume the handoff as an effect.
- Emit no mint, `AttenuationRequest`, data-store operation, or
`operational_proposal.v1`.

## Evidence

Record:

- disposition decision, confidence, and reason;
- pass rate, run count, and sample window;
- failure-mode counts and cited run evidence;
- quarantine duration and exclusion marker when present;
- refusal or escalation reason;
- dispatch target;
- harness case names;
- sealed receipt id.

## Local verification

```bash
runx --version
runx harness ./skills/flaky-test-judge
```

The inline harness contains exactly two cases:

- `quarantine_justified`: 13 passes and 7 failures over 20 runs, including 6
explicit timeouts, produce a bounded quarantine packet.
- `missing_run_history`: no runs produce a sealed missing-evidence stop with no
quarantine packet.

After publishing:

```bash
runx add <owner>/flaky-test-judge@0.1.0
runx skill <owner>/flaky-test-judge@0.1.0 --json
runx verify --receipt <receipt.json> --json
```
174 changes: 174 additions & 0 deletions skills/flaky-test-judge/X.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
skill: flaky-test-judge
version: "0.1.0"

catalog:
kind: skill
audience: operator
visibility: public
role: canonical

harness:
cases:
- name: quarantine_justified
inputs:
test_run_history:
sample_size: 20
window_start: "2026-06-23T00:00:00Z"
window_end: "2026-06-30T00:00:00Z"
runs:
- { status: passed, duration: 1.1, logs: "passed" }
- { status: passed, duration: 1.0, logs: "passed" }
- { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
- { status: passed, duration: 1.2, logs: "passed" }
- { status: passed, duration: 1.1, logs: "passed" }
- { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
- { status: passed, duration: 1.0, logs: "passed" }
- { status: passed, duration: 1.3, logs: "passed" }
- { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
- { status: passed, duration: 1.1, logs: "passed" }
- { status: passed, duration: 1.0, logs: "passed" }
- { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
- { status: passed, duration: 1.2, logs: "passed" }
- { status: passed, duration: 1.1, logs: "passed" }
- { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
- { status: passed, duration: 1.0, logs: "passed" }
- { status: passed, duration: 1.3, logs: "passed" }
- { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
- { status: passed, duration: 1.1, logs: "passed" }
- { status: failed, duration: 2.4, logs: "AssertionError: retry count was 2, expected 3" }
test_metadata:
test_path: tests/integration/test_checkout.py::test_retries_expired_session
suite: integration
tags:
- release-blocking
- network
target_repo: example/project
base: main
release_policy:
flake_threshold_pct: 70
min_sample_size: 20
max_quarantine_days: 7
caller:
answers:
agent_task.flaky-test-judge.output:
schema: runx.flaky.test_triage.v1
disposition:
decision: quarantine
confidence: 0.96
reason: 65% pass rate over 20 supplied runs is below the 70% threshold; 6 of 7 failures contain an explicit timeout.
quarantine:
test_path: tests/integration/test_checkout.py::test_retries_expired_session
duration_days: 7
exclusion_marker: '@pytest.mark.skip(reason="quarantined: tracked flaky timeout")'
fix_template:
thread_title: Fix flaky checkout retry timeout
thread_body: Temporarily exclude the named test using the supplied marker, preserve the cited timeout evidence, fix the intermittent checkout timeout, and remove the exclusion before closing the issue.
target_repo: example/project
base: main
escalation:
required: false
lane: human_review
reason: ""
dispatch_target: issue-to-pr
metrics:
pass_rate_pct: 65
run_count: 20
passing_runs: 13
failing_runs: 7
window_start: "2026-06-23T00:00:00Z"
window_end: "2026-06-30T00:00:00Z"
failure_modes:
timeout: 6
assertion: 1
evidence_refs:
- runs[2,5,8,11,14,17].logs: TimeoutError
- runs[19].logs: AssertionError
- release_policy.flake_threshold_pct: 70
- release_policy.max_quarantine_days: 7
stop_conditions: []
expect:
status: sealed
receipt:
schema: runx.receipt.v1

- name: missing_run_history
inputs:
test_run_history:
sample_size: 0
window_start: "2026-06-30T00:00:00Z"
window_end: "2026-06-30T00:00:00Z"
runs: []
test_metadata:
test_path: tests/integration/test_checkout.py::test_retries_expired_session
suite: integration
tags:
- release-blocking
target_repo: example/project
base: main
release_policy:
flake_threshold_pct: 70
min_sample_size: 20
max_quarantine_days: 7
expect:
status: needs_agent

runners:
judge:
default: true
type: agent-task
agent: reviewer
task: flaky-test-judge
runx:
category: ops
post_run:
reflect: never
outputs:
schema: string
disposition: object
quarantine: object
escalation: object
dispatch_target: string
metrics: object
evidence_refs: array
stop_conditions: array
artifacts:
wrap_as: flaky_test_triage
packet: runx.flaky.test_triage.v1
instructions: >
Judge one test only from the supplied test_run_history, test_metadata,
and release_policy. Verify sample_size equals the number of runs. Compute
pass_rate_pct from supplied statuses and count failure modes only when
their evidence is visible in supplied logs. Cite run indexes or exact log
fragments. If no history is supplied, the sample-size declaration is
inconsistent, or the run count is below min_sample_size, emit a sealed
disposition with decision=stop, a reason beginning missing-evidence, no
quarantine object, escalation.required=true in lane human_review, no
dispatch target, and a matching stop condition. Refuse quarantine when
pass rate is at or above flake_threshold_pct. Escalate near-threshold or
conflicting evidence. Use decision=fix-now only for a reproducible defect
visible in the supplied evidence, and decision=ignore-environmental only
for explicitly external noise. Quarantine only when the sufficient sample
is below threshold and intermittent failure evidence is explicit. A
quarantine packet must name the exact test_path, a positive duration_days
no greater than max_quarantine_days, an exclusion_marker, and a
fix_template containing thread_title, thread_body, target_repo, and base.
Set dispatch_target=issue-to-pr only as a handoff name. This skill never
edits a repository, disables a test, opens an issue, starts a PR, or
consumes its packet as an effect. The downstream issue-to-pr run is
separate and a human merge gate is the only path to a live disable. Emit
schema=runx.flaky.test_triage.v1. Do not invent failures, hidden retries,
causes, authority, a mint, AttenuationRequest, data-store operation, or
operational_proposal.v1.
inputs:
test_run_history:
type: json
required: true
description: Ordered test runs with status, duration, logs, declared sample size, and sample window.
test_metadata:
type: json
required: true
description: Test path, suite, tags, target repository, and base branch.
release_policy:
type: json
required: true
description: Pass-rate threshold, minimum sample size, and maximum quarantine duration.
Loading