Status update (2026-05-12 UTC): Refreshed against current GitHub issue state. #190, #198, #199, #215, and #216 are now closed, so the remaining open items should be prioritized from current code evidence and telemetry.
Tracks 28 optimization workstreams identified in the April 2026 AI agent audit + world-class best-practices research. Sources cited inline in each child issue.
Sequencing update (2026-04-17): telemetry + eval move to the front. Don't optimize prompts further without a way to measure. Build the scoreboard first.
TL;DR — expected impact
- Cost: 60-80% input-token reduction on active chat sessions once cache restructuring ships; another 60-80% on chitchat once intent routing ships.
- Latency: cache hits cut TTFT by up to 85%; hedged requests eliminate failover penalty during outages.
- Quality: production agent-eval (tool choice, args, end-state, multi-trial) + A/B framework replaces ship-blind prompt edits with measurable iteration.
- Reliability: circuit breaker handles 99-99.5% LLM uptime reality; budget cap protects BYOK users from runaway loops.
Phase 0 — build the scoreboard first
Ship nothing downstream until we can measure what it did.
Phase 1 — architectural reshape
Biggest structural wins. Do these before touching prompts further.
Phase 2 — cache restructuring
Unblocked once #192 and #216 land.
Phase 3 — routing + tool surface
With the scoreboard live, now safe to tune.
Phase 4 — reliability
Phase 5 — quality systems + memory
Parking lot — do after data
Key dependencies
References
- Anthropic: Effective context engineering, Writing effective tools, Advanced tool use, Demystifying evals for AI agents
- Case studies: ProjectDiscovery 59% cost reduction, "Don't Break the Cache" paper
- Agent-eval papers: τ-bench, AgentBench, Reflexion, Lost in the Middle
- Vercel AI SDK: Loop Control, Telemetry
- Reliability: Tian Pan: LLM API resilience, Portkey: circuit breakers
- Eval: Braintrust agent evaluation, PostHog LLM Analytics, OpenAI: Evaluate agent workflows
Status update (2026-05-12 UTC): Refreshed against current GitHub issue state. #190, #198, #199, #215, and #216 are now closed, so the remaining open items should be prioritized from current code evidence and telemetry.
Tracks 28 optimization workstreams identified in the April 2026 AI agent audit + world-class best-practices research. Sources cited inline in each child issue.
Sequencing update (2026-04-17): telemetry + eval move to the front. Don't optimize prompts further without a way to measure. Build the scoreboard first.
TL;DR — expected impact
Phase 0 — build the scoreboard first
Ship nothing downstream until we can measure what it did.
Phase 1 — architectural reshape
Biggest structural wins. Do these before touching prompts further.
Phase 2 — cache restructuring
Unblocked once #192 and #216 land.
Phase 3 — routing + tool surface
With the scoreboard live, now safe to tune.
Phase 4 — reliability
Phase 5 — quality systems + memory
Parking lot — do after data
Key dependencies
References