Deliberative latency & observability: speculative overlap, structured outputs, and UI alignment

## Summary

* Reduce latency by overlapping risk estimation with speculative draft generation where safe.
* Improve reliability and cost of JSON-emitting modules by adopting structured outputs (schema-constrained responses) where appropriate.
* Keep the dashboard and markdown export accurate when execution becomes more parallel and logging more structured—UI and persistence must stay aligned with runtime behavior.

## Motivation

* **Latency:** Sequential “risk → then generate” adds wall-clock time before deliberation can start. Overlapping compatible work can cut perceived latency without changing governance outcomes, provided routing still uses the final risk result and refuse paths discard unusable drafts.
* **Structured outputs:** Critic, simulator, hindsight, and perspectives rely on JSON; parse failures trigger retries and extra LLM calls. Structured outputs reduce malformed JSON and retry churn.
* **Observability:** Parallel execution and new logging paths must still persist `llm_calls` with correct `run_id` / `request_id`, and the request detail UI must visualize parallel phases truthfully (not as a misleading sequential pipeline). Markdown export must remain consistent with persisted data.

## Scope

### A. Speculative overlap (risk || draft generation)

**Goal:** Run risk estimation and a speculative first-pass `policy.generate` in parallel; reuse the draft only when routing allows it; discard on REFUSE (accepting wasted token cost in that branch).

**Requirements (high level):**

* Orchestration runs overlap only when a config flag allows it (e.g. orchestrator-level toggle).
* Final routing uses completed risk estimation; speculative draft is never a substitute for policy decisions.
* Reuse rules respect existing safety constraints (e.g. no reuse where constrained generation or different system prompts apply).
* Persistence: Any LLM call executed on a worker thread must propagate persistence context (`contextvars`) so `llm_calls` rows are not dropped (risk estimator and policy speculative call must both persist).

**Non-goals:** Changing `final_action` semantics, constitution evaluation, or refusal logic.

---

### B. Structured outputs (JSON modules)

**Goal:** Where modules require machine-parseable JSON, use API structured output / JSON schema (or project-standard equivalent) so outputs validate without fragile repair loops.

**Candidate modules:** critic, simulator, hindsight, perspectives (and any shared completion path in `policy._complete()` that must forward `response_format`).

**Requirements (high level):**

* Define or reuse canonical schemas aligned with existing Pydantic / parsing expectations.
* Single propagation path through the policy client so modules do not duplicate request-building logic.
* Backward compatibility: Migration plan for stored rows / benchmarks if response shape changes (document any field additions).
* Failure handling: Log and surface failures; avoid silent fallback that hides schema drift.

**Non-goals:** Replacing all free-text generation with JSON; changing module semantics beyond format guarantees.

---

### C. UI & export alignment

**Goal:** Request detail and export remain trustworthy when (A) and (B) land.

**Requirements (high level):**

* Persistence: All relevant calls appear in `llm_calls` with correct correlation IDs after parallel execution.
* Flow / cycle visualization: Grouping of calls into “tiers” (parallel vs sequential) should reflect wall-clock overlap, not only static `sequence_in_cycle`—e.g. speculative risk+generate overlap, and critic||sim||persp when applicable.
* Labels: Connector copy between steps must not imply a sequential critic gate when modules actually ran in parallel.
* Markdown export: `export_request_markdown` / DB-backed report must include risk and module activity consistent with the UI (same underlying rows).
* Optional: Short note in module docs / env template for any new flags (structured output toggles, overlap toggles).

---

## Implementation notes (for assignees)

* Reuse existing `contextvars.copy_context()` patterns used elsewhere for parallel module execution when submitting work to a thread pool.
* Centralize UI tier logic: merge adjacent tiers when time ranges overlap after static sequence grouping.
* Coordinate policy layer (`_complete` / `response_format`) with runtime modules that parse JSON.

---

## Related documentation

* Architecture spec / orchestrator module doc for orchestrator flags and flows.
* Persistence module doc for `llm_calls` and context.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deliberative latency & observability: speculative overlap, structured outputs, and UI alignment #1

Summary

Motivation

Scope

A. Speculative overlap (risk || draft generation)

B. Structured outputs (JSON modules)

C. UI & export alignment

Implementation notes (for assignees)

Related documentation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Deliberative latency & observability: speculative overlap, structured outputs, and UI alignment #1

Description

Summary

Motivation

Scope

A. Speculative overlap (risk || draft generation)

B. Structured outputs (JSON modules)

C. UI & export alignment

Implementation notes (for assignees)

Related documentation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions