Skip to content

Latest commit

 

History

History
101 lines (85 loc) · 10.4 KB

File metadata and controls

101 lines (85 loc) · 10.4 KB

Agent Worker Implementation Specification

This document belongs to the L1 kernel layer and describes how the Worker implements and complies with the L0 protocol.

1. Roles and Responsibilities

  • Subscribe to cmd.agent.{worker_target}.wakeup as the execution bell, determined by [worker].worker_targets in config.toml.
  • Pull tasks and receipts from state.agent_inbox.
  • Hydrate state/resource/cards, then execute the ReAct loop.
  • Write cards and state.*, and publish events and streaming output.
  • The semantic source of truth is the inbox record; the wakeup header is not the semantic source of truth.

2. Concurrency Model and State Constraints

  • Single instance, multi-coroutine: A fixed number of coroutines are started per process/instance, and tasks are processed concurrently from an internal queue.
  • Strict statelessness: Do not store any mutable business state related to agent/turn/step on self (for example, current agent, context, temporary results, cache, etc.).
  • State ownership: Business state must exist only in WorkItem, function local variables, or persistent storage.
  • Allowed self contents: Immutable config, connection pools/clients, queues, loggers, and other infrastructure objects.
  • Lock boundary: Locks may only be used for shared statistics/throttling/resource-pool management without isolation requirements; they must not be used to share business state with a lock as an isolation substitute.

3. Worker Target

  • worker_target is the unique routing field and comes from Profile/Roster.
  • Worker instances declare which targets to consume via [worker].worker_targets in config.toml (multiple targets allowed).
  • Common values: worker_generic (generic pool), ui_worker (UI entrypoint), sandbox (controlled execution environment).

3.1 Single-instance Target (svc1_) Naming Convention

  • Purpose: provide a deliverable, non-contention route for “stateful inbox-consuming services/orchestrators” (to avoid contention among multiple process instances for the same agent_id inbox).
  • Protocol constraint: worker_target must be a single segment token of a NATS subject; . / * / > / whitespace are forbidden. Recommended character set is [a-z0-9_-], all lowercase.
  • Naming format: svc1_<service>_<scope>
  • service: service/binary name, such as better_demo / ground_control.
  • scope: unique scope; recommended to include env + cluster or project_id (to avoid cross-environment/cross-project collisions), such as dev_proj_cgnd_demo_01.
  • Operational rule (must follow): Any svc1_* target in the same environment may only be subscribed to and consumed by one process instance (the corresponding cmd.agent.{target}.wakeup can only have one “real processor”).
  • Multi-replica/HA: Do not let multiple replicas share the same svc1_*. Instead, use per-replica independent agent_id + worker_target (upstream selects caller by instance), or introduce a leader election/DB lock mechanism for single-primary semantics before sharing.

4. Read/Write Boundaries

  • Read: state.*, resource.*, cards/boxes (through card-box-cg).
  • Write: state.agent_state_head/agent_steps (for only its own AgentTurn), cards/boxes; when necessary write Report to state.agent_inbox via ExecutionService.

5. Core Flow (including state and hydration)

  1. Receive cmd.agent.*.wakeupclaim inbox first (FOR UPDATE SKIP LOCKED), then state update with CAS gate.
  2. state_agent initializes agent_turn_id/turn_epoch at enqueue/claim in L0; agent_turn is not generated by Worker.
  3. When entering processing, Worker sets agent_turn to running (state transition with CAS protection), then hydrates in order: state.agent_state_headresource.*profile_box_id/context_box_id/output_box_id.
  4. Greedy Batch: fetch all executable inbox for the same Worker-correlated agents from claim_pending_inbox, and process them in chronological order.
  5. If tool calls exist: write tool.call card and publish cmd.tool.* / cmd.sys.pmo.internal.*, record tool_call_id within the step, and set activity=executing_tool (LLM step metadata).
  6. Each tool wait writes turn_waiting_tools to state and sets resume_deadline; suspend_timeout is governed by worker-specific config and tool-side timeout constraints. The current implementation does not persist expecting_correlation_id as the matching condition (common path is None); resume judgment is unified on turn_waiting_tools + state.agent_inbox(correlation=tool_call_id).
  7. When status=suspended, wake/resume no longer depends on single-correlation filtering; watchdog and resume paths reconcile turn_waiting_tools rows in waiting/received state and claim the matching due inbox rows (pending/deferred) before continuing.
  8. After tool return, or for normal LLM turns with no tool calls, continue to the next step under next_step(is_continuation=True) within the same agent_turn_id/turn_epoch (intra-turn continuation; resume may cause turn_epoch reordering).
  9. Terminal state: write task.deliverable, publish evt.agent.*.task, clear active_agent_turn_id, and return status to idle.

Constraint: Worker must not generate agent_turn_id/turn_epoch by itself; invalid/missing inbox should log warnings and stop related side effects.

Supplemental note (context incremental updates during batch processing):

  • Hydration is done once on wakeup to obtain an initial context snapshot.
  • The Worker ReAct is step-level continuation: a single turn may trigger multiple model calls in one processing path, and multiple Enqueue/report operations can sequence progress within the same agent_turn.
  • When processing each step, new input and output cards are incrementally appended to context, rather than fully rebuilding each time.
  • output_box_id cards written via sync writes can be replayed through subsequent wakeup hydration to recover after crash/restart.

6. Hydration Order (fixed)

  1. Read state.agent_state_head and perform gate checks.
  2. Read resource.project_agents / resource.tools / resource.profiles.
  3. Read profile_box_id / context_box_id / output_box_id.

Supplemental note (visibility boundary):

  • Worker only reads cards/boxes referenced by the above box_ids and does not automatically scan or read other boxes in the project.
  • Therefore, cross-agent context sharing must happen explicitly: upstream/PMO must perform context packing (or in future by tool), assembling the required cards into a new context_box_id.
  • Reference: 03_kernel_l1/pmo_orchestration.md
  • output_box_id is rolling memory: each turn merges cards from output_box_id into context, so outputs are read back.

6.1 Tool Specification Construction (tool_spec integration)

  • Entry file: services/agent_worker/tool_spec.py
    • build_tool_specs: converts rows in resource.tools to LLM tool specs (OpenAI-style); actual implementation is in infra/llm/tool_specs.py.
    • filter_allowed_tools: filters by profile allowed_tools allowlist and logs mismatch warnings.
  • Processing details:
    • options.args.defaults: inject defaults when LLM omits parameters.
    • options.args.fixed / options.envs: force inject and hide from LLM parameters.
  • After building the tool list, Worker appends the built-in submit_result (see services/agent_worker/builtin_tools.py).

7. Event Publishing

  • evt.agent.*.step: phase event (started/planning/executing/completed).
  • evt.agent.*.task: terminal event, must include deliverable_card_id.
  • evt.agent.*.chunk: streaming output (Core NATS).
  • LLM return usage / response_cost are written to state.agent_steps.metadata (llm_usage / llm_response_cost), which supports SQL aggregation.
  • The tool-call ID list returned by LLM is written to state.agent_steps.tool_call_ids (step-level index, replacing legacy state.agent_turns.tool_call_ids).

8. Idempotency and Error Handling

  • Any UPDATE state must include turn_epoch + active_agent_turn_id; zero affected rows means stop side effects.
  • Idempotency and convergence of tool_result/tool callback are jointly constrained by turn_waiting_tools + state.agent_inbox(correlation_id=tool_call_id); apply is single-winner on the waiting row, and later duplicates converge as idempotent consume rather than duplicate append.
  • If status or turn mismatch/box expired, resume is rerouted to drop/retry and emits observability signals instead of blind duplicate submission.
  • Worker watchdog rescans wait tables and claims due rows from state.agent_inbox: on timeout and replay conditions, it injects tool.result timeout report and drives wakeup back into the same turn flow.
  • error/parse failures use a failure fallback path: for example, when task.result_fields or tool schemas are invalid or parse errors occur, it logs warnings and continues an available degraded path; it is not treated as hard protocol rejection unless the model request cannot be built.
  • Worker crashes are reclaimed by PMO; watchdog and resume handle timeout and replay, and should not modify others' state across agents.

9. Explicit Completion (built-in submit_result)

  • The built-in workflow tool submit_result does not go through UTP.
  • Semantics: write tool.result + task.deliverable and end the Turn.
  • Constraint: only one invocation of this tool is allowed within the same turn.
  • If context_box_id contains task.result_fields, do minimal field validation; if parsing fails, log a warning and continue with field-missing degraded output (current implementation favors availability first).
  • task.result_fields structural requirement: content should be parsable as FieldsSchemaContent; if it does not conform, log a warning and apply backward-compatible degradation instead of hard rejection (unless a key field needed by the main chain is unavailable).
  • Profile field must_end_with (default empty): when a turn has no tool_calls:
    • If must_end_with is empty/missing: automatically write this turn’s assistant content as task.deliverable and end the Turn (empty content uses placeholder text).
    • If must_end_with is non-empty: write sys.must_end_with_required prompt and re-enqueue to continue (requiring one of the tools listed there in the next call).

Note: The mandatory check for must_end_with currently applies only when there are no tool_calls and the Agent attempts natural completion. If a tool called in the turn explicitly returns after_execution=terminate, the current Turn is directly terminated, allowing must_end_with to be bypassed.