epic: AI agent optimization — 23 workstreams

**Status update (2026-05-12 UTC):** Refreshed against current GitHub issue state. #190, #198, #199, #215, and #216 are now closed, so the remaining open items should be prioritized from current code evidence and telemetry.

Tracks 28 optimization workstreams identified in the April 2026 AI agent audit + world-class best-practices research. Sources cited inline in each child issue.

**Sequencing update (2026-04-17)**: telemetry + eval move to the front. Don't optimize prompts further without a way to measure. Build the scoreboard first.

## TL;DR — expected impact

- **Cost**: 60-80% input-token reduction on active chat sessions once cache restructuring ships; another 60-80% on chitchat once intent routing ships.
- **Latency**: cache hits cut TTFT by up to 85%; hedged requests eliminate failover penalty during outages.
- **Quality**: production agent-eval (tool choice, args, end-state, multi-trial) + A/B framework replaces ship-blind prompt edits with measurable iteration.
- **Reliability**: circuit breaker handles 99-99.5% LLM uptime reality; budget cap protects BYOK users from runaway loops.

## Phase 0 — build the scoreboard first

Ship nothing downstream until we can measure what it did.

- [x] #195 — observability(ai): experimental_telemetry + per-run summary row (TTFT, route, tool sequence, retries, cache hits, costs)
- [x] #201 — quality(ai): production agent-eval framework (5 dimensions: tool choice, args, transcript, end-state, multi-trial)

## Phase 1 — architectural reshape

Biggest structural wins. Do these before touching prompts further.

- [x] #212 — perf(ai): persist coach_state as a table, not rebuilt per turn
- [x] #213 — perf(ai): remove exercise catalog from system prompt (tools-only)
- [x] #214 — perf(ai): indexed movement search (replace full-table scan)
- [x] #192 — perf(ai): move training snapshot out of system field (blocks #193)
- [x] #215 — perf(ai): provider-aware full-prompt token budget
- [x] #216 — chore(ai): explicit model policy + preview-vs-stable rule

## Phase 2 — cache restructuring

Unblocked once #192 and #216 land.

- [x] #193 — perf(ai): stack 4 cache breakpoints
- [ ] #194 — perf(ai): enable Gemini explicit context caching
- [x] #187 — perf(ai): pin Anthropic effort parameter explicitly
- [x] #188 — perf(ai): discount cached tokens in DAILY_TOKEN_BUDGET
- [x] #198 — perf(ai): track usage.iterations[] for compaction-aware billing

## Phase 3 — routing + tool surface

With the scoreboard live, now safe to tune.

- [x] #190 — perf(ai): intent-based model routing (consumes #216 policy)
- [ ] #189 — perf(ai): tool description audit + input_examples
- [ ] #196 — perf(ai): Tool Search Tool + defer_loading (85% context reduction)
- [ ] #210 — perf(ai): prepareStep dynamic tool subset per step
- [ ] #204 — perf(ai): tool consolidation audit (strength/goals/injuries triplets)
- [ ] #205 — perf(ai): adaptive thinking via prepareStep (effort=high on program_week)
- [ ] #207 — perf(ai): conditional system-prompt sections

## Phase 4 — reliability

- [x] #191 — reliability(ai): StopCondition budget cap for BYOK users
- [x] #199 — reliability(ai): per-provider circuit breaker with error + cost triggers
- [ ] #200 — reliability(ai): hedged requests for latency-bound chat path
- [ ] #197 — perf(ai): enable Anthropic context management beta

## Phase 5 — quality systems + memory

- [ ] #202 — feat(ai): long-term user memory layer (preferences pilot)
- [ ] #203 — quality(ai): prompt A/B testing via PostHog feature flags
- [ ] #208 — perf(ai): precompute workout performance/weekly volume off-path

## Parking lot — do after data

- [ ] #206 — observability(ai): measure searchOtherThreads hit rate (then tune or disable)
- [ ] #209 — perf(ai): move to 1-hour cache TTL after measuring session distribution

## Key dependencies

- #201 + #195 gate everything else (Phase 0 blocks Phase 1+)
- #193 blocked by #192 (snapshot must move before history can be cached)
- #198 pairs with #197 (iterations billing must land with context management)
- #203 depends on #201 + #195 (A/B needs eval + per-user telemetry)
- #209 depends on #195 (TTL choice needs session-length data)
- #214 pairs with #213 (catalog removal increases search_exercises call volume)
- #215 depends on #195 (to benchmark the 3 retrieval modes)

## References

- Anthropic: [Effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents), [Writing effective tools](https://www.anthropic.com/engineering/writing-tools-for-agents), [Advanced tool use](https://www.anthropic.com/engineering/advanced-tool-use), [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
- Case studies: [ProjectDiscovery 59% cost reduction](https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching), ["Don't Break the Cache" paper](https://arxiv.org/abs/2601.06007)
- Agent-eval papers: [τ-bench](https://arxiv.org/abs/2406.12045), [AgentBench](https://arxiv.org/abs/2308.03688), [Reflexion](https://arxiv.org/abs/2303.11366), [Lost in the Middle](https://arxiv.org/abs/2307.03172)
- Vercel AI SDK: [Loop Control](https://ai-sdk.dev/docs/agents/loop-control), [Telemetry](https://ai-sdk.dev/docs/ai-sdk-core/telemetry)
- Reliability: [Tian Pan: LLM API resilience](https://tianpan.co/blog/2026-03-11-llm-api-resilience-production), [Portkey: circuit breakers](https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/)
- Eval: [Braintrust agent evaluation](https://www.braintrust.dev/articles/ai-agent-evaluation-framework), [PostHog LLM Analytics](https://posthog.com/llm-analytics), [OpenAI: Evaluate agent workflows](https://developers.openai.com/api/docs/guides/agent-evals)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: AI agent optimization — 23 workstreams #211

TL;DR — expected impact

Phase 0 — build the scoreboard first

Phase 1 — architectural reshape

Phase 2 — cache restructuring

Phase 3 — routing + tool surface

Phase 4 — reliability

Phase 5 — quality systems + memory

Parking lot — do after data

Key dependencies

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

epic: AI agent optimization — 23 workstreams #211

Description

TL;DR — expected impact

Phase 0 — build the scoreboard first

Phase 1 — architectural reshape

Phase 2 — cache restructuring

Phase 3 — routing + tool surface

Phase 4 — reliability

Phase 5 — quality systems + memory

Parking lot — do after data

Key dependencies

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions