Skip to content

SirBrenton/pitstop-scan

Repository files navigation

Pitstop Scan

Local-first reliability triage for AI + API execution.

Given execution receipts, Scan produces a ranked hazard pack showing exactly where reliability is leaking:

  • where latency breaches concentrate
  • where failures concentrate (429s, timeouts, auth, 5xx)
  • which guardrails to implement first

Output: a 1-page reliability snapshot + prioritized guardrail plan.

No dashboards. No calls. Just receipts → hazards → guardrails.

Start here: docs/START_HERE.md


Why you’d run this

If you’ve ever said:

  • "We can’t tell (quickly) which call is killing tail latency.”
  • “Retries are spiking / rate limits are killing us.”
  • “It usually works… but the tail is brutal.”

Pitstop Scan turns that into a ranked list of failure + breach signatures you can actually fix.


Quickstart (5 minutes)

1) Install (repo-local venv)

make deps

Option A — You already have receipts (JSONL)

Place your file at:

input/exhaust.jsonl

Then run:

make run

Open the report:

open output/report.md   # macOS

Option B — You only have a raw failure blob

If you don’t have receipts yet, you can start from a copied error/log blob:

python -m pitstop_scan.cli intake --in blob.txt --out output/intake-pack --run-scan

This will produce:

  • raw_scrubbed.txt — scrubbed copy of the input blob
  • artifact.json — normalized boundary classification summary
  • exhaust.jsonl — scan-compatible receipt
  • summary.md — human-readable provisional diagnosis
  • derived/ — scan outputs

This is intentionally v1:

  • heuristic signal extraction
  • conservative scrubbing
  • simple provisional classification

Then open the report:

open output/intake-pack/derived/report.md

No data yet? Run a demo

Run a synthetic demo (creates a tiny input/exhaust.jsonl):

make demo
open output/report.md

If make demo works, you’re ready — replace input/exhaust.jsonl with your real file and rerun make run.


What you get

Running the scan writes:

  • output/report.md — 1-page reliability snapshot + top fixes
  • output/hazards.csv — ranked hazards (highest leverage first)
  • output/signatures.csv — per-signature rollups
  • output/summary.json — machine totals (automation-friendly)
  • output/pitstop_pack_agg.zip — zip of the four derived outputs above

Breach = latency exceeds the per-attempt deadline (budget.deadline_ms) even if status is ok.
(If receipts use legacy budget_ms, Scan treats it as budget.deadline_ms.)

If you use intake, Scan writes an intermediate pack before generating derived outputs.


Proof (sample hazard pack)

See a sample output pack (derived aggregates only):


Receipts (drop-in invariants)

Small, runnable “truth artifacts”: invariant → policy → tests → verification.

These receipts are the guardrails the scan will often recommend.


Live 429 Classifier

POST a 429 status + headers, get back a classification (WAIT / CAP / STOP) + the first knob to adjust.

curl -s -X POST https://web-production-273d3.up.railway.app/classify \
  -H "Content-Type: application/json" \
  -d '{"status":429,"headers":{"retry-after":"30"},"provider":"anthropic"}'

Three cases:

Input Classification Meaning
retry-after: 30 WAIT Honor header, retry after delay
retry-after: 600 STOP Quota exhaustion — do not retry
no header CAP Reduce concurrency — do not retry immediately

Send a real 429 blob (status + headers) to brentondwilliams@gmail.com and I'll run it through.


The Execution Contract (v1.0)

Pitstop Scan is a reference implementation of the Pitstop Execution Contract:

execute(intent, budget, policy) -> result, receipt

The contract defines the correctness rules for reliable execution:

  • Budget semantics: per-attempt deadlines, max elapsed, retry caps (attempts include fallbacks)
  • Classification taxonomy: 429 vs 402 vs timeout vs auth (retryable vs terminal)
  • Scope correctness: model vs provider vs credential (cooldown blast radius)
  • Audit-grade receipts: emitted on every attempt (including block/cooldown/preemption)

Read the spec:

EXECUTION_CONTRACT.md


Input contract (minimum viable)

Each JSONL line should include (aligned to the Execution Contract):

Required:

  • tool_id, operation, endpoint_norm
  • outcome.status and (if fail) outcome.error_class (optionally outcome.http_status)
  • cost.latency_ms
  • budget.deadline_ms (or legacy budget_ms)
  • execution_id, attempt_id

Recommended (improves ranking + fix guidance):

  • budget.max_elapsed_ms
  • budget.retry_budget
  • decision.action

That’s enough to rank hazards and generate the pack.

Notes on “loss” and cost framing

The report may include a priced loss model to help rank fixes. Treat it as tunable. The primary truth signals are breach rate and tail latency (p95/p99).


Privacy / safety (hard boundary)

  • Local-only by default. No data leaves your machine.
  • Input should be operational receipts, not payloads.

Receipts MUST NOT include:

  • prompts, message content, tool payload bodies, response bodies
  • headers, tokens, API keys, cookies
  • raw URLs or query strings (use endpoint_norm)

Outputs are derived summaries only (no raw requests/responses).

Raw intake blobs may temporarily contain headers or error text; intake output is scrubbed before generating scan-ready receipts.


Want help applying the fix order?

If you'd like a second set of eyes on your hazard pack, send the derived pack only:

output/pitstop_pack_agg.zip

to brentondwilliams@gmail.com with:

  • Stack: (e.g. Python + OpenAI + Redis)
  • Workflow: (e.g. agent toolchain, ingestion pipeline)
  • Goal: (reduce 429s | reduce p99 | stabilize retries | fix failover)

You’ll receive:

  • a prioritized guardrail plan
  • concrete configuration / policy changes
  • a verification checklist to confirm the fix worked

No calls required. Just artifacts → fix order → verification.


Before / after delta (optional)

If you can share 50–200 redacted receipts (JSONL metadata only), I can re-run the scan after you ship the guardrails and return a before/after delta.

If you prefer to keep receipts local, the derived pack alone is enough to generate a fix order.

About

Local-first reliability snapshot for AI workflows — find retry burn, latency breaches, and what to cap next.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors