One line in. A founder-grade, evidence-grounded startup package out.
An organization of AI subagents — CEO → Research → Product → Architecture → Execution → Presentation — that turns "Build an AI SaaS for resume screening" into market research, a cited PRD, a real OpenAPI spec, a sprint plan, and an investor pitch. The output is not code — it's the execution package a founder would pay an agency for.
Most "agent" demos fake tool diversity with a row of near-identical LLM wrappers. APS doesn't. The model genuinely chooses between 20+ distinct live sources — GitHub issues, Hacker News, Reddit, Stack Exchange, arXiv, package registries, Google Trends — to gather real, clickable evidence, then a chain of specialist subagents composes that evidence into a typed, schema-valid execution package. Every claim traces back to a source. Every run streams live. It reproduces on free tiers with your own keys.
And it doesn't stop at documents: APS scores the idea (0–10, grounded), debates itself to a Build/Pivot/Don't verdict, explains every feature back to its evidence, and creates a real GitHub repo with issues and milestones. It's an autonomous studio that reasons and acts — see Beyond documents below.
"Build an AI SaaS for resume screening"
│
▼
┌──────────────────────────┐
│ CEO / Orchestrator │ LangGraph · holds typed state only
└──────────────────────────┘
│
┌──────────────┬──────────────┬──────┴───────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
Research → Product → Architecture → Execution → Presentation
fan-out:3 personas data model backlog pitch
sub-research features OpenAPI 3.0 sprints demo script
20+10 tools MVP scope tech stack roadmap investor memo
(cited brief) (cited PRD) (TRD) infra cost judge brief
│
└─ each subagent: isolated context · its own scoped tools · returns ONE typed Pydantic object
flowchart TB
Idea([💡 one-line idea]) --> CEO
subgraph ORG[" "]
direction TB
CEO["🧠 CEO / Orchestrator<br/><i>LangGraph · typed state · SSE events</i>"]
CEO --> R["🔎 Research"]
R --> P["📋 Product"]
P --> A["🏗️ Architecture"]
A --> E["🗂️ Execution"]
E --> PR["🎤 Presentation"]
end
PR --> OUT([📦 Execution Package])
R -. scoped .-> RT["retrieval ×20 + analysis ×10"]
P -. scoped .-> PT["product ×6"]
A -. scoped .-> AT["architecture ×6"]
E -. scoped .-> ET["execution ×6"]
PR -. scoped .-> PRT["presentation ×4"]
CEO -. observed by .-> INFRA["infra (not tools): http retry · rate-limit · metrics · artifact store · render"]
classDef agent fill:#eef,stroke:#88a,stroke-width:1px;
classDef tools fill:#fff7e6,stroke:#d9a441;
class CEO,R,P,A,E,PR agent;
class RT,PT,AT,ET,PRT tools;
Per-agent scoping is the trick that keeps a 50-tool registry coherent: globally there are 52 model-callable tools, but no single agent's model ever sees more than ~20 (ADR-0005). Selection is real function-calling, never an
if intent == …dispatch table.
sequenceDiagram
autonumber
participant U as Client / CLI
participant API as FastAPI
participant CEO as Orchestrator
participant R as Research (fan-out)
participant DA as Downstream agents
U->>API: POST /runs { idea }
API-->>U: 202 { run_id, status: running }
API->>CEO: run_sync(idea) — background thread
CEO->>R: plan angles → N parallel sub-researchers (ThreadPool)
R->>R: each = isolated gather_evidence(focus) over scoped tools
R-->>CEO: ONE merged + compressed ResearchReturn (typed, deduped, cited)
CEO->>DA: Product → Architecture → Execution → Presentation
DA-->>CEO: PRD · TRD(OpenAPI) · ExecutionPlan · PitchPackage
CEO-->>API: lifecycle events (agent_start · artifact_ready · composition · run_complete)
U->>API: GET /runs/{id}/events (SSE — live timeline)
U->>API: GET /runs/{id}/artifacts/prd (typed JSON)
U->>API: GET /runs/{id}/artifacts/prd?format=md (human Markdown)
A single run is naturally 25–35+ tool calls (the research fan-out alone fires 3 sub-researchers) — the long-horizon proof, with only compact summaries ever entering the model's context.
Each arrow is a typed Pydantic object, not a re-prompt. The orchestrator holds only these structured returns — that is the context-management strategy.
flowchart LR
R["<b>ResearchReturn</b><br/>market_size · competitors[]<br/>pain_points[] · evidence[]"]
--> P["<b>PRD</b><br/>personas[] · features[]<br/>mvp_scope · requirements[]"]
P --> T["<b>TRD</b><br/>data_model · api_spec (OpenAPI)<br/>stack[] · scale"]
T --> E["<b>ExecutionPlan</b><br/>backlog[] · sprints[]<br/>roadmap · infra_cost"]
E --> Pi["<b>PitchPackage</b><br/>pitch · demo_script<br/>investor_memo"]
Research.pain_points → Product.prioritize_features → PRD.requirements → Architecture.design_api_contract → TRD → Execution.generate_backlog → sprints — and a composition event is emitted on the Research→PRD handoff so you can see the data flow in the live stream.
The execution package is the foundation, not the finish line. On top of every run, APS adds a reasoning + action layer — each a derived, deterministic, on-demand view (so the JSON-native pipeline, the frozen contract, and the 52-tool count are all untouched):
| Capability | What it does | Endpoint |
|---|---|---|
| 🎯 Startup Score | grades the idea 0–10 across Market Opportunity · Competitive Whitespace · Technical Feasibility · Monetization · Founder Velocity — each with an evidence-grounded rationale | GET /runs/{id}/score[?format=md] |
| ⚖️ Autonomous Debate | a Risk agent argues against building from the same evidence; a synthesizer returns a Build / Pivot / Don't verdict + confidence | GET /runs/{id}/debate[?format=md] |
| 🔍 Explain-Why | for every feature: the pain/competitor it came from, the evidence that grounds it, and a confidence score | GET /runs/{id}/explain[?format=md] |
| 📊 Architecture diagrams | the TRD → live Mermaid system flowchart + ER diagram | GET /runs/{id}/artifacts/trd?format=mermaid |
| 🚀 GitHub Launch Mode | turns the plan into a real GitHub repo — README, milestones, and issues — via the live API (preview-safe without a token) | POST /runs/{id}/launch/github |
flowchart LR
PKG["📦 Execution Package<br/>(Research → PRD → TRD → Plan → Pitch)"]
PKG --> SC["🎯 Startup Score"]
PKG --> DB["⚖️ Debate → verdict"]
PKG --> EX["🔍 Explain-Why"]
PKG --> AD["📊 Architecture diagrams"]
PKG --> GH["🚀 Real GitHub repo"]
"It doesn't just build — it has an opinion, with reasons, and it ships a real repo."
| Namespace | Count | What the model can do | Scoped to |
|---|---|---|---|
| retrieval | 20 | hit 20 distinct live sources (web · GitHub ×4 · HN ×2 · Reddit ×3 · StackExchange · ProductHunt · Trends · arXiv · Wikipedia · PyPI · npm · jobs · pricing · fetch) | Research |
| analysis | 10 | mine pains · cluster themes · sentiment · competitor matrix · market size · rank opportunities · trend signal · validate/dedupe evidence | Research (in compression) |
| product | 6 | personas · user stories · prioritize features · MVP scope · acceptance criteria · assemble PRD | Product |
| architecture | 6 | data model · OpenAPI 3.0 contract · tech stack · scale · components · assemble TRD | Architecture |
| execution | 6 | repo plan · backlog · effort · sprints · roadmap · infra cost | Execution |
| presentation | 4 | pitch outline · demo script · investor memo · judge brief | Presentation |
| = 52 |
Infra is deliberately not counted as tools —
http/retry/metrics/rate_limiter/artifact_store/render/loggingare platform capability. Reclassifying them out of the count (instead of inflating to 59) is itself the judgment Req-1 tests.
| # | Requirement | How APS meets it |
|---|---|---|
| 1 | 50+ model-driven tools | 52 tools across 6 namespaces; ~30 hit genuinely different live sources → the model chooses. Per-agent scoping keeps selection coherent. |
| 2 | Subagent orchestration | 5 specialists, each isolated context + scoped tools, each returning one typed object the CEO holds. |
| 3 | Long-horizon + context mgmt | 25–35+ tool calls/run; the loop holds structured Evidence, the model sees only compact summaries. |
| 4 | Production scaffolding | LangGraph · FastAPI + SSE · Pydantic · Structlog · Tenacity · Prometheus · Pytest · file artifact store · LLM rate-limit. |
| 5 | Real composition | Typed handoffs Research → PRD → TRD → Execution → Pitch; a composition event makes it visible. |
# 1. Environment
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]" # or: pip install -r requirements.txt
# 2. (optional) keys — runs fully without them, better with them
cp .env.example .env # GEMINI_API_KEY or NVIDIA_API_KEY, + APS_GITHUB_PAT, TAVILY_API_KEY …
# 3. Run the full studio from the CLI
python -m aps.orchestrator.run "Build an AI SaaS for resume screening"
python scripts/demo_run.py "a privacy-first habit tracker" # drops JSON + Markdown per artifact
# 4. Run the API (the frontend's backend)
uvicorn aps.api.main:app --reload
# POST /runs · GET /runs/{id} · GET /runs/{id}/events (SSE)
# GET /runs/{id}/artifacts/{name}[?format=md] · GET /metrics
# 5. Tests (offline, hermetic — no keys, no network)
pytest # 319 passing, 1 live test skipped without a keyEvery token-gated tool runs live with its key and otherwise returns [fixture]-stamped sample data (and logs tool_fixture_fallback) — so a reviewer can always run it, and can always tell live data from fixture. With zero keys, research still produces a genuine (simpler) brief via the no-key tools; only a fully empty result marks the run DEGRADED.
| Env var / install | Unlocks | Without it |
|---|---|---|
GEMINI_API_KEY (or GOOGLE_API_KEY) |
Gemini tool-selecting research (default provider) | keyless deterministic research, else stub (DEGRADED) |
NVIDIA_API_KEY + APS_MODEL_PROVIDER=nim |
NIM provider path | — |
APS_GITHUB_PAT |
live GitHub issues / repos / code search | [fixture] |
TAVILY_API_KEY |
live web_search |
[fixture] |
REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET |
OAuth Reddit (compliant quota) | public JSON (429-prone) → [fixture] |
pip install -e ".[trends]" (pytrends) |
live Google Trends in trends_interest |
[fixture] |
| no keys / no install | HN · StackExchange · Wikipedia · arXiv · jobs (no-key tools) | genuine keyless brief |
Free LLM tiers rate-limit hard (Gemini is 15 RPM) and a single key can 429 mid-demo. So APS
can run over a priority chain of providers with automatic failover — set
APS_PROVIDER_CHAIN and the keys you have:
APS_PROVIDER_CHAIN="groq,cerebras,gemini,nim,openrouter" # try in order, fail over on 429/5xx/timeout/auth- 18 providers supported by just setting a key (Gemini · NVIDIA NIM · Groq · Cerebras · SambaNova · Mistral · OpenRouter · GitHub Models · DeepSeek · Together · xAI · OpenAI · Anthropic) — most are OpenAI-compatible, so it's one client + a base URL.
- Self-hosted / local models too — Ollama, LM Studio, vLLM, LocalAI, llama.cpp: start the
server, set
APS_ENABLE_<NAME>=true(+ optionalAPS_<NAME>_BASE_URL/_MODEL), and add it to the chain (e.g.APS_PROVIDER_CHAIN=vllm,gemini,groq) — run fully offline with a cloud failover behind it. - Parallel diversification: the research fan-out runs each sub-researcher on a different provider → combine free quotas, ~3× throughput, and any single 429 is covered by that unit's own failover chain.
- Per-provider rate limits, key rotation (
GROQ_API_KEY_2…), all deterministic. - Smart & resilient: a capability router orders providers best-fit-first per task
(tool-tasks never get a no-tool model; load-spread by quota headroom), a circuit breaker
benches a 429'd provider for a cooldown so the chain routes around it, portable context
lets a provider switch mid-loop without losing the message history, and
/metricsexposesaps_llm_calls_total{provider}+aps_llm_failover_total{provider}. - Verify your keys:
python scripts/live_providers_smoke.py "<idea>"prints a provider × tool-calling matrix. Full design:multipleAPIplan.md. - Back-compat: unset
APS_PROVIDER_CHAIN→ the original singlegemini/nimpath, unchanged.
The pipeline is JSON-native and JSON-only-persisted. Any artifact flips to a polished Markdown document — persona tables, cited requirements, an endpoint table + the full OpenAPI — via a pure, request-time render. Nothing in the pipeline changes; the renderer is infra, not a tool.
GET /runs/{id}/artifacts/prd → typed JSON (default — machines & frontend)
GET /runs/{id}/artifacts/prd?format=md → text/markdown (humans & download)
| State | When | Meaning |
|---|---|---|
complete |
valid key → real LLM tool-selecting research, or no key → deterministic keyless research over no-key tools | real, idea-specific, grounded evidence |
degraded |
every live/keyless path returned nothing | ran end-to-end on the labeled stub fixture — never reported as complete |
failed |
a key is present but invalid (preflight 401) | fail fast, don't silently degrade |
A fixture-backed run is never reported complete; fixture evidence is [fixture]-stamped and dropped from real briefs. This honesty is the point.
Full pipeline wired — 52 tools, all 6 agents, real LangGraph orchestrator, FastAPI backend with SSE + on-demand Markdown, plus the reasoning/action layer (Startup Score · Autonomous Debate · Explain-Why · Architecture diagrams · real GitHub Launch). 319 tests green (offline/hermetic; 1 live test skipped without a key), whole tree ruff-clean. The live research tool-loop is verified against a real model (NIM nemotron); research fan-out (parallel sub-researchers → one merged brief) is live; Gemini's default path is schema-guarded (binds retrieval-only) with a key-gated live test. Eval scored live on g01 (e2e ✅, schema-valid PRD ✅, coverage 1.0); the 8-idea gold set runs offline in CI. A coherent demo run is committed at docs/demo_state.json.
Output stays credible on noisy ideas: a deterministic noise filter keeps page-chrome ("Log in · Get Started · Book a Demo"), greetings, and issue-template scaffolding out of the pains/features, and a denylist stops directories/social (LinkedIn, G2, Crozdesk) from posing as competitors — so the headline feature is a real pain, not a navbar.
Every backend differentiator is done. Only the React frontend/ remains (the deliberately-deferred area) — and the backend already exposes everything it needs: SSE events, persisted run state, and the /score · /debate · /explain · /launch · ?format=mermaid endpoints. Full hand-off: HANDOFF.md; win-roadmap: remaining.md.
| Path | What lives here |
|---|---|
src/aps/state/ |
Typed Pydantic state — the frozen contract everyone codes against |
src/aps/orchestrator/ |
CEO LangGraph, nodes, EventBus, run entrypoint |
src/aps/agents/research/ |
Research Agent: tool-loop · fan-out supervisor · keyless path · compression |
src/aps/agents/{product,architecture,execution,presentation}/ |
the four downstream agents |
src/aps/tools/{retrieval,analysis,product,architecture,execution,presentation}/ |
the 52 tools + auto-discovery registry.py |
src/aps/render/ |
typed artifact → Markdown + Mermaid (infra, on-demand) |
src/aps/scoring/ · debate/ · explain/ · launch/ |
the reasoning/action layer (Startup Score · Debate · Explain-Why · real GitHub Launch) — derived, not tools |
src/aps/infra/ |
http · retry · metrics · rate_limiter · llm rate-limit · artifact_store · logging |
src/aps/api/ |
FastAPI surface (runs · SSE · artifacts · ?format=md|mermaid · /score · /debate · /explain · /launch · /metrics) |
tests/ |
319 unit + integration + eval tests; gold set + scorers |
docs/ |
PRD · TRD · HLD · ADRs · EVALUATION · API_CONTRACT · MEMO · plan · PRODUCTION_PLAN |
frontend/ |
React UI (deferred) |
Read next: docs/MEMO.md (what's deep vs thin, and the decisions defended) · docs/PRODUCTION_PLAN.md (the path to a deployable product) · docs/TEAM_GUIDE.md.
One finished vertical beats ten agents at 40%.