Skip to content

Deliberative latency & observability: speculative overlap, structured outputs, and UI alignment #1

@fdidonato

Description

@fdidonato

Summary

  • Reduce latency by overlapping risk estimation with speculative draft generation where safe.
  • Improve reliability and cost of JSON-emitting modules by adopting structured outputs (schema-constrained responses) where appropriate.
  • Keep the dashboard and markdown export accurate when execution becomes more parallel and logging more structured—UI and persistence must stay aligned with runtime behavior.

Motivation

  • Latency: Sequential “risk → then generate” adds wall-clock time before deliberation can start. Overlapping compatible work can cut perceived latency without changing governance outcomes, provided routing still uses the final risk result and refuse paths discard unusable drafts.
  • Structured outputs: Critic, simulator, hindsight, and perspectives rely on JSON; parse failures trigger retries and extra LLM calls. Structured outputs reduce malformed JSON and retry churn.
  • Observability: Parallel execution and new logging paths must still persist llm_calls with correct run_id / request_id, and the request detail UI must visualize parallel phases truthfully (not as a misleading sequential pipeline). Markdown export must remain consistent with persisted data.

Scope

A. Speculative overlap (risk || draft generation)

Goal: Run risk estimation and a speculative first-pass policy.generate in parallel; reuse the draft only when routing allows it; discard on REFUSE (accepting wasted token cost in that branch).

Requirements (high level):

  • Orchestration runs overlap only when a config flag allows it (e.g. orchestrator-level toggle).
  • Final routing uses completed risk estimation; speculative draft is never a substitute for policy decisions.
  • Reuse rules respect existing safety constraints (e.g. no reuse where constrained generation or different system prompts apply).
  • Persistence: Any LLM call executed on a worker thread must propagate persistence context (contextvars) so llm_calls rows are not dropped (risk estimator and policy speculative call must both persist).

Non-goals: Changing final_action semantics, constitution evaluation, or refusal logic.


B. Structured outputs (JSON modules)

Goal: Where modules require machine-parseable JSON, use API structured output / JSON schema (or project-standard equivalent) so outputs validate without fragile repair loops.

Candidate modules: critic, simulator, hindsight, perspectives (and any shared completion path in policy._complete() that must forward response_format).

Requirements (high level):

  • Define or reuse canonical schemas aligned with existing Pydantic / parsing expectations.
  • Single propagation path through the policy client so modules do not duplicate request-building logic.
  • Backward compatibility: Migration plan for stored rows / benchmarks if response shape changes (document any field additions).
  • Failure handling: Log and surface failures; avoid silent fallback that hides schema drift.

Non-goals: Replacing all free-text generation with JSON; changing module semantics beyond format guarantees.


C. UI & export alignment

Goal: Request detail and export remain trustworthy when (A) and (B) land.

Requirements (high level):

  • Persistence: All relevant calls appear in llm_calls with correct correlation IDs after parallel execution.
  • Flow / cycle visualization: Grouping of calls into “tiers” (parallel vs sequential) should reflect wall-clock overlap, not only static sequence_in_cycle—e.g. speculative risk+generate overlap, and critic||sim||persp when applicable.
  • Labels: Connector copy between steps must not imply a sequential critic gate when modules actually ran in parallel.
  • Markdown export: export_request_markdown / DB-backed report must include risk and module activity consistent with the UI (same underlying rows).
  • Optional: Short note in module docs / env template for any new flags (structured output toggles, overlap toggles).

Implementation notes (for assignees)

  • Reuse existing contextvars.copy_context() patterns used elsewhere for parallel module execution when submitting work to a thread pool.
  • Centralize UI tier logic: merge adjacent tiers when time ranges overlap after static sequence grouping.
  • Coordinate policy layer (_complete / response_format) with runtime modules that parse JSON.

Related documentation

  • Architecture spec / orchestrator module doc for orchestrator flags and flows.
  • Persistence module doc for llm_calls and context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions