Agent Worker Implementation Specification

This document belongs to the L1 kernel layer and describes how the Worker implements and complies with the L0 protocol.

1. Roles and Responsibilities

Subscribe to cmd.agent.{worker_target}.wakeup as the execution bell, determined by [worker].worker_targets in config.toml.
Pull tasks and receipts from state.agent_inbox.
Hydrate state/resource/cards, then execute the ReAct loop.
Write cards and state.*, and publish events and streaming output.
The semantic source of truth is the inbox record; the wakeup header is not the semantic source of truth.

2. Concurrency Model and State Constraints

Single instance, multi-coroutine: A fixed number of coroutines are started per process/instance, and tasks are processed concurrently from an internal queue.
Strict statelessness: Do not store any mutable business state related to agent/turn/step on self (for example, current agent, context, temporary results, cache, etc.).
State ownership: Business state must exist only in WorkItem, function local variables, or persistent storage.
Allowed self contents: Immutable config, connection pools/clients, queues, loggers, and other infrastructure objects.
Lock boundary: Locks may only be used for shared statistics/throttling/resource-pool management without isolation requirements; they must not be used to share business state with a lock as an isolation substitute.

3. Worker Target

worker_target is the unique routing field and comes from Profile/Roster.
Worker instances declare which targets to consume via [worker].worker_targets in config.toml (multiple targets allowed).
Common values: worker_generic (generic pool), ui_worker (UI entrypoint), sandbox (controlled execution environment).

3.1 Single-instance Target (`svc1_`) Naming Convention

Purpose: provide a deliverable, non-contention route for “stateful inbox-consuming services/orchestrators” (to avoid contention among multiple process instances for the same agent_id inbox).
Protocol constraint: worker_target must be a single segment token of a NATS subject; . / * / > / whitespace are forbidden. Recommended character set is [a-z0-9_-], all lowercase.
Naming format: svc1_<service>_<scope>
service: service/binary name, such as better_demo / ground_control.
scope: unique scope; recommended to include env + cluster or project_id (to avoid cross-environment/cross-project collisions), such as dev_proj_cgnd_demo_01.
Operational rule (must follow): Any svc1_* target in the same environment may only be subscribed to and consumed by one process instance (the corresponding cmd.agent.{target}.wakeup can only have one “real processor”).
Multi-replica/HA: Do not let multiple replicas share the same svc1_*. Instead, use per-replica independent agent_id + worker_target (upstream selects caller by instance), or introduce a leader election/DB lock mechanism for single-primary semantics before sharing.

4. Read/Write Boundaries

Read: state.*, resource.*, cards/boxes (through card-box-cg).
Write: state.agent_state_head/agent_steps (for only its own AgentTurn), cards/boxes; when necessary write Report to state.agent_inbox via ExecutionService.

5. Core Flow (including state and hydration)

Receive cmd.agent.*.wakeup → claim inbox first (FOR UPDATE SKIP LOCKED), then state update with CAS gate.
state_agent initializes agent_turn_id/turn_epoch at enqueue/claim in L0; agent_turn is not generated by Worker.
When entering processing, Worker sets agent_turn to running (state transition with CAS protection), then hydrates in order: state.agent_state_head → resource.* → profile_box_id/context_box_id/output_box_id.
Greedy Batch: fetch all executable inbox for the same Worker-correlated agents from claim_pending_inbox, and process them in chronological order.
If tool calls exist: write tool.call card and publish cmd.tool.* / cmd.sys.pmo.internal.*, record tool_call_id within the step, and set activity=executing_tool (LLM step metadata).
Each tool wait writes turn_waiting_tools to state and sets resume_deadline; suspend_timeout is governed by worker-specific config and tool-side timeout constraints. The current implementation does not persist expecting_correlation_id as the matching condition (common path is None); resume judgment is unified on turn_waiting_tools + state.agent_inbox(correlation=tool_call_id).
When status=suspended, wake/resume no longer depends on single-correlation filtering; watchdog and resume paths reconcile turn_waiting_tools rows in waiting/received state and claim the matching due inbox rows (pending/deferred) before continuing.
After tool return, or for normal LLM turns with no tool calls, continue to the next step under next_step(is_continuation=True) within the same agent_turn_id/turn_epoch (intra-turn continuation; resume may cause turn_epoch reordering).
Terminal state: write task.deliverable, publish evt.agent.*.task, clear active_agent_turn_id, and return status to idle.

Constraint: Worker must not generate agent_turn_id/turn_epoch by itself; invalid/missing inbox should log warnings and stop related side effects.

Supplemental note (context incremental updates during batch processing):

Hydration is done once on wakeup to obtain an initial context snapshot.
The Worker ReAct is step-level continuation: a single turn may trigger multiple model calls in one processing path, and multiple Enqueue/report operations can sequence progress within the same agent_turn.
When processing each step, new input and output cards are incrementally appended to context, rather than fully rebuilding each time.
output_box_id cards written via sync writes can be replayed through subsequent wakeup hydration to recover after crash/restart.

6. Hydration Order (fixed)

Read state.agent_state_head and perform gate checks.
Read resource.project_agents / resource.tools / resource.profiles.
Read profile_box_id / context_box_id / output_box_id.

Supplemental note (visibility boundary):

Worker only reads cards/boxes referenced by the above box_ids and does not automatically scan or read other boxes in the project.
Therefore, cross-agent context sharing must happen explicitly: upstream/PMO must perform context packing (or in future by tool), assembling the required cards into a new context_box_id.
Reference: 03_kernel_l1/pmo_orchestration.md
output_box_id is rolling memory: each turn merges cards from output_box_id into context, so outputs are read back.

6.1 Tool Specification Construction (tool_spec integration)

Entry file: services/agent_worker/tool_spec.py
- build_tool_specs: converts rows in resource.tools to LLM tool specs (OpenAI-style); actual implementation is in infra/llm/tool_specs.py.
- filter_allowed_tools: filters by profile allowed_tools allowlist and logs mismatch warnings.
Processing details:
- options.args.defaults: inject defaults when LLM omits parameters.
- options.args.fixed / options.envs: force inject and hide from LLM parameters.
After building the tool list, Worker appends the built-in submit_result (see services/agent_worker/builtin_tools.py).

7. Event Publishing

evt.agent.*.step: phase event (started/planning/executing/completed).
evt.agent.*.task: terminal event, must include deliverable_card_id.
evt.agent.*.chunk: streaming output (Core NATS).
LLM return usage / response_cost are written to state.agent_steps.metadata (llm_usage / llm_response_cost), which supports SQL aggregation.
The tool-call ID list returned by LLM is written to state.agent_steps.tool_call_ids (step-level index, replacing legacy state.agent_turns.tool_call_ids).

8. Idempotency and Error Handling

Any UPDATE state must include turn_epoch + active_agent_turn_id; zero affected rows means stop side effects.
Idempotency and convergence of tool_result/tool callback are jointly constrained by turn_waiting_tools + state.agent_inbox(correlation_id=tool_call_id); apply is single-winner on the waiting row, and later duplicates converge as idempotent consume rather than duplicate append.
If status or turn mismatch/box expired, resume is rerouted to drop/retry and emits observability signals instead of blind duplicate submission.
Worker watchdog rescans wait tables and claims due rows from state.agent_inbox: on timeout and replay conditions, it injects tool.result timeout report and drives wakeup back into the same turn flow.
error/parse failures use a failure fallback path: for example, when task.result_fields or tool schemas are invalid or parse errors occur, it logs warnings and continues an available degraded path; it is not treated as hard protocol rejection unless the model request cannot be built.
Worker crashes are reclaimed by PMO; watchdog and resume handle timeout and replay, and should not modify others' state across agents.

9. Explicit Completion (built-in submit_result)

The built-in workflow tool submit_result does not go through UTP.
Semantics: write tool.result + task.deliverable and end the Turn.
Constraint: only one invocation of this tool is allowed within the same turn.
If context_box_id contains task.result_fields, do minimal field validation; if parsing fails, log a warning and continue with field-missing degraded output (current implementation favors availability first).
task.result_fields structural requirement: content should be parsable as FieldsSchemaContent; if it does not conform, log a warning and apply backward-compatible degradation instead of hard rejection (unless a key field needed by the main chain is unavailable).
Profile field must_end_with (default empty): when a turn has no tool_calls:
- If must_end_with is empty/missing: automatically write this turn’s assistant content as task.deliverable and end the Turn (empty content uses placeholder text).
- If must_end_with is non-empty: write sys.must_end_with_required prompt and re-enqueue to continue (requiring one of the tools listed there in the next call).

Note: The mandatory check for must_end_with currently applies only when there are no tool_calls and the Agent attempts natural completion. If a tool called in the turn explicitly returns after_execution=terminate, the current Turn is directly terminated, allowing must_end_with to be bypassed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Worker Implementation Specification

1. Roles and Responsibilities

2. Concurrency Model and State Constraints

3. Worker Target

3.1 Single-instance Target (`svc1_`) Naming Convention

4. Read/Write Boundaries

5. Core Flow (including state and hydration)

6. Hydration Order (fixed)

6.1 Tool Specification Construction (tool_spec integration)

7. Event Publishing

8. Idempotency and Error Handling

9. Explicit Completion (built-in submit_result)

FilesExpand file tree

agent_worker.md

Latest commit

History

agent_worker.md

File metadata and controls

Agent Worker Implementation Specification

1. Roles and Responsibilities

2. Concurrency Model and State Constraints

3. Worker Target

3.1 Single-instance Target (svc1_) Naming Convention

4. Read/Write Boundaries

5. Core Flow (including state and hydration)

6. Hydration Order (fixed)

6.1 Tool Specification Construction (tool_spec integration)

7. Event Publishing

8. Idempotency and Error Handling

9. Explicit Completion (built-in submit_result)

3.1 Single-instance Target (`svc1_`) Naming Convention