runxhq · bbbbzzzzcc-afk · Jul 1, 2026 · Jul 1, 2026 · Jul 1, 2026 · Jul 1, 2026
diff --git a/skills/flaky-test-judge/REPORT.md b/skills/flaky-test-judge/REPORT.md
@@ -0,0 +1,11 @@
+# Flaky Test Judge delivery report
+
+- Published `bbbbzzzzcc-afk/flaky-test-judge@sha-cd2fdb45ec9e` to the public Runx registry and confirmed the immutable package can be read back and installed into an empty directory.
+- Opened upstream pull request [runxhq/runx#197](https://github.com/runxhq/runx/pull/197) with the skill definition, operator documentation, and exactly two harness cases.
+- Exercised the published package against 20 supplied runs: 13 passes, six explicit timeout failures, and one assertion failure. The resulting pass rate is 65%.
+- The reviewed disposition is `quarantine` with confidence `0.96`, a seven-day ceiling, an explicit pytest exclusion marker, and downstream handoff target `issue-to-pr`.
+- The empty-history boundary refuses to invent evidence. The harness leaves it at `needs_agent`, and the independent fixture requires a `missing-evidence` stop, no quarantine object, and no dispatch target.
+- The skill is evidence-only. It does not edit repositories, change CI, disable a test, open an issue, create a pull request, or merge code.
+- The dogfood run emitted receipt `runx:receipt:c4bb972e74c43278b503eba0cc076b00264582addfce9a72eeadac8674f92550`.
+- Independent verification of that receipt returned `valid: true` in production mode with no findings.
+- Raw machine-readable evidence is in [`evidence.json`](./evidence.json), and the verification matrix is in [`verification.json`](./verification.json).
diff --git a/skills/flaky-test-judge/SKILL.md b/skills/flaky-test-judge/SKILL.md
@@ -0,0 +1,135 @@
+---
+name: flaky-test-judge
+description: Judge supplied test-run history and emit a bounded flaky-test disposition without mutating a repository.
+runx:
+  category: ops
+---
+
+# Flaky Test Judge
+
+Decide whether one test should be quarantined temporarily, treated as
+environmental noise, fixed as a real bug, or escalated for human review.
+
+The skill reads only the supplied run history, test metadata, and release
+policy. It emits a `runx.flaky.test_triage.v1` handoff packet. It never edits a
+test, changes CI configuration, opens an issue, or starts a pull request.
+
+## Inputs
+
+- `test_run_history`: ordered `runs` containing `status`, `duration`, and
+  `logs`, plus the declared `sample_size`, `window_start`, and `window_end`.
+- `test_metadata`: `test_path`, `suite`, and optional `tags`.
+- `release_policy`: `flake_threshold_pct`, `min_sample_size`, and
+  `max_quarantine_days`.
+
+## Evidence calculation
+
+1. Count only supplied runs.
+2. Verify `sample_size` equals the number of supplied runs.
+3. Compute pass rate as `passing runs / total runs * 100`.
+4. Count failure modes only when their evidence is visible in supplied logs.
+5. Cite run indexes or exact log fragments for every failure-mode claim.
+
+Do not infer hidden retries, infrastructure incidents, or product defects.
+
+## Disposition rules
+
+- **Stop:** no run history, a sample-size mismatch, or fewer runs than
+  `min_sample_size`. Use a `missing-evidence` reason and emit no quarantine
+  packet.
+- **Keep enabled:** pass rate is at or above `flake_threshold_pct`. Refuse to
+  quarantine.
+- **Human review:** evidence is near the threshold, failure modes conflict, or
+  logs do not distinguish environmental noise from a real break.
+- **Fix now:** supplied evidence consistently identifies a reproducible product
+  or test defect rather than intermittent environmental behavior.
+- **Ignore environmental noise:** failures are clearly external and policy
+  allows no repository action.
+- **Quarantine:** pass rate is below the policy threshold, the sample is
+  sufficient, intermittent failure evidence is explicit, and a bounded
+  temporary exclusion is justified.
+
+Confidence must reflect the supplied evidence. A classification cannot exceed
+the strength of its cited logs.
+
+## Quarantine packet
+
+When quarantine is justified, include:
+
+```yaml
+schema: runx.flaky.test_triage.v1
+disposition:
+  decision: quarantine
+  confidence: 0.96
+  reason: 65% pass rate over 20 runs; six of seven failures are explicit timeouts.
+quarantine:
+  test_path: tests/integration/test_checkout.py::test_retries_expired_session
+  duration_days: 7
+  exclusion_marker: '@pytest.mark.skip(reason="quarantined: tracked flaky timeout")'
+  fix_template:
+    thread_title: Fix flaky checkout retry timeout
+    thread_body: >
+      Temporarily exclude the named test, preserve the cited run evidence, and
+      remove the exclusion when the timeout cause is fixed.
+    target_repo: example/project
+    base: main
+escalation:
+  required: false
+  lane: human_review
+  reason: ""
+dispatch_target: issue-to-pr
+```
+
+`duration_days` must be positive and must never exceed
+`release_policy.max_quarantine_days`. The packet names `issue-to-pr` only as a
+downstream dispatch target. A separate governed run must map the fix template
+to `thread_title`, `thread_body`, `target_repo`, and `base`.
+
+The downstream run drafts the change. A human merge gate is the only path to a
+live test disable.
+
+## Refusal and escalation
+
+- Refuse quarantine when the pass rate is at or above the threshold.
+- Stop when no runs are supplied or the sample is below policy minimum.
+- Escalate near-threshold or conflicting evidence.
+- Never invent a failure mode absent from the logs.
+- Never exceed the quarantine duration ceiling.
+- Never mutate a repository or consume the handoff as an effect.
+- Emit no mint, `AttenuationRequest`, data-store operation, or
+  `operational_proposal.v1`.
+
+## Evidence
+
+Record:
+
+- disposition decision, confidence, and reason;
+- pass rate, run count, and sample window;
+- failure-mode counts and cited run evidence;
+- quarantine duration and exclusion marker when present;
+- refusal or escalation reason;
+- dispatch target;
+- harness case names;
+- sealed receipt id.
+
+## Local verification
+
+```bash
+runx --version
+runx harness ./skills/flaky-test-judge
+```
+
+The inline harness contains exactly two cases:
+
+- `quarantine_justified`: 13 passes and 7 failures over 20 runs, including 6
+  explicit timeouts, produce a bounded quarantine packet.
+- `missing_run_history`: no runs produce a sealed missing-evidence stop with no
+  quarantine packet.
+
+After publishing:
+
+```bash
+runx add <owner>/flaky-test-judge@0.1.0
+runx skill <owner>/flaky-test-judge@0.1.0 --json
+runx verify --receipt <receipt.json> --json
+```
diff --git a/skills/flaky-test-judge/X.yaml b/skills/flaky-test-judge/X.yaml
@@ -0,0 +1,174 @@
+skill: flaky-test-judge
+version: "0.1.0"
+
+catalog:
+  kind: skill
+  audience: operator
+  visibility: public
+  role: canonical
+
+harness:
+  cases:
+    - name: quarantine_justified
+      inputs:
+        test_run_history:
+          sample_size: 20
+          window_start: "2026-06-23T00:00:00Z"
+          window_end: "2026-06-30T00:00:00Z"
+          runs:
+            - { status: passed, duration: 1.1, logs: "passed" }
+            - { status: passed, duration: 1.0, logs: "passed" }
+            - { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
+            - { status: passed, duration: 1.2, logs: "passed" }
+            - { status: passed, duration: 1.1, logs: "passed" }
+            - { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
+            - { status: passed, duration: 1.0, logs: "passed" }
+            - { status: passed, duration: 1.3, logs: "passed" }
+            - { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
+            - { status: passed, duration: 1.1, logs: "passed" }
+            - { status: passed, duration: 1.0, logs: "passed" }
+            - { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
+            - { status: passed, duration: 1.2, logs: "passed" }
+            - { status: passed, duration: 1.1, logs: "passed" }
+            - { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
+            - { status: passed, duration: 1.0, logs: "passed" }
+            - { status: passed, duration: 1.3, logs: "passed" }
+            - { status: failed, duration: 30.0, logs: "TimeoutError: checkout response exceeded 30s" }
+            - { status: passed, duration: 1.1, logs: "passed" }
+            - { status: failed, duration: 2.4, logs: "AssertionError: retry count was 2, expected 3" }
+        test_metadata:
+          test_path: tests/integration/test_checkout.py::test_retries_expired_session
+          suite: integration
+          tags:
+            - release-blocking
+            - network
+          target_repo: example/project
+          base: main
+        release_policy:
+          flake_threshold_pct: 70
+          min_sample_size: 20
+          max_quarantine_days: 7
+      caller:
+        answers:
+          agent_task.flaky-test-judge.output:
+            schema: runx.flaky.test_triage.v1
+            disposition:
+              decision: quarantine
+              confidence: 0.96
+              reason: 65% pass rate over 20 supplied runs is below the 70% threshold; 6 of 7 failures contain an explicit timeout.
+            quarantine:
+              test_path: tests/integration/test_checkout.py::test_retries_expired_session
+              duration_days: 7
+              exclusion_marker: '@pytest.mark.skip(reason="quarantined: tracked flaky timeout")'
+              fix_template:
+                thread_title: Fix flaky checkout retry timeout
+                thread_body: Temporarily exclude the named test using the supplied marker, preserve the cited timeout evidence, fix the intermittent checkout timeout, and remove the exclusion before closing the issue.
+                target_repo: example/project
+                base: main
+            escalation:
+              required: false
+              lane: human_review
+              reason: ""
+            dispatch_target: issue-to-pr
+            metrics:
+              pass_rate_pct: 65
+              run_count: 20
+              passing_runs: 13
+              failing_runs: 7
+              window_start: "2026-06-23T00:00:00Z"
+              window_end: "2026-06-30T00:00:00Z"
+              failure_modes:
+                timeout: 6
+                assertion: 1
+            evidence_refs:
+              - runs[2,5,8,11,14,17].logs: TimeoutError
+              - runs[19].logs: AssertionError
+              - release_policy.flake_threshold_pct: 70
+              - release_policy.max_quarantine_days: 7
+            stop_conditions: []
+      expect:
+        status: sealed
+        receipt:
+          schema: runx.receipt.v1
+
+    - name: missing_run_history
+      inputs:
+        test_run_history:
+          sample_size: 0
+          window_start: "2026-06-30T00:00:00Z"
+          window_end: "2026-06-30T00:00:00Z"
+          runs: []
+        test_metadata:
+          test_path: tests/integration/test_checkout.py::test_retries_expired_session
+          suite: integration
+          tags:
+            - release-blocking
+          target_repo: example/project
+          base: main
+        release_policy:
+          flake_threshold_pct: 70
+          min_sample_size: 20
+          max_quarantine_days: 7
+      expect:
+        status: needs_agent
+
+runners:
+  judge:
+    default: true
+    type: agent-task
+    agent: reviewer
+    task: flaky-test-judge
+    runx:
+      category: ops
+      post_run:
+        reflect: never
+    outputs:
+      schema: string
+      disposition: object
+      quarantine: object
+      escalation: object
+      dispatch_target: string
+      metrics: object
+      evidence_refs: array
+      stop_conditions: array
+    artifacts:
+      wrap_as: flaky_test_triage
+      packet: runx.flaky.test_triage.v1
+    instructions: >
+      Judge one test only from the supplied test_run_history, test_metadata,
+      and release_policy. Verify sample_size equals the number of runs. Compute
+      pass_rate_pct from supplied statuses and count failure modes only when
+      their evidence is visible in supplied logs. Cite run indexes or exact log
+      fragments. If no history is supplied, the sample-size declaration is
+      inconsistent, or the run count is below min_sample_size, emit a sealed
+      disposition with decision=stop, a reason beginning missing-evidence, no
+      quarantine object, escalation.required=true in lane human_review, no
+      dispatch target, and a matching stop condition. Refuse quarantine when
+      pass rate is at or above flake_threshold_pct. Escalate near-threshold or
+      conflicting evidence. Use decision=fix-now only for a reproducible defect
+      visible in the supplied evidence, and decision=ignore-environmental only
+      for explicitly external noise. Quarantine only when the sufficient sample
+      is below threshold and intermittent failure evidence is explicit. A
+      quarantine packet must name the exact test_path, a positive duration_days
+      no greater than max_quarantine_days, an exclusion_marker, and a
+      fix_template containing thread_title, thread_body, target_repo, and base.
+      Set dispatch_target=issue-to-pr only as a handoff name. This skill never
+      edits a repository, disables a test, opens an issue, starts a PR, or
+      consumes its packet as an effect. The downstream issue-to-pr run is
+      separate and a human merge gate is the only path to a live disable. Emit
+      schema=runx.flaky.test_triage.v1. Do not invent failures, hidden retries,
+      causes, authority, a mint, AttenuationRequest, data-store operation, or
+      operational_proposal.v1.
+    inputs:
+      test_run_history:
+        type: json
+        required: true
+        description: Ordered test runs with status, duration, logs, declared sample size, and sample window.
+      test_metadata:
+        type: json
+        required: true
+        description: Test path, suite, tags, target repository, and base branch.
+      release_policy:
+        type: json
+        required: true
+        description: Pass-rate threshold, minimum sample size, and maximum quarantine duration.