-
Notifications
You must be signed in to change notification settings - Fork 98
refactor: Separate task state building from persistence #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
ebe5673
a21ec66
bae253a
b8215d2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,150 @@ | ||
| # ADR 0001: Task State Management Refactoring | ||
|
|
||
| ## Status | ||
|
|
||
| Accepted | ||
|
|
||
| ## Context | ||
|
|
||
| The original implementation of task state management had a significant architectural issue: | ||
|
|
||
| **Multiple Persistence Operations**: Task state changes were being persisted multiple times during event propagation. The `ResultAggregator` would save task state for each event processed, resulting in redundant writes to `TaskStore` for a single request. This created unnecessary I/O load and coupling between event processing and persistence. | ||
|
|
||
| ## Decision | ||
|
|
||
| We refactored the task state management to follow a two-phase approach with proper lifecycle management: | ||
|
|
||
| ### Separate State Building from Persistence | ||
|
|
||
| **Introduced `TaskStateProcessor` for In-Memory State Management**: | ||
| - Created per `DefaultRequestHandler` instance to maintain request handler's in-flight tasks | ||
| - Maintains task state in memory during event processing | ||
| - Provides methods to build task state from events without persisting | ||
| - Includes `removeTask()` method for explicit cleanup | ||
|
|
||
| **Modified Task Lifecycle**: | ||
| - Events are processed to build state in `TaskStateProcessor` without immediate persistence | ||
| - State is persisted **once** to `TaskStore` at appropriate lifecycle points (completion, cancellation, etc.) | ||
| - Tasks are explicitly removed from `TaskStateProcessor` after final persistence | ||
|
|
||
| ### Task Cleanup Strategy | ||
|
|
||
| Tasks are removed from the state processor when they reach their final state: | ||
|
|
||
| 1. **Blocking Message Sends**: After all events are processed and final state is persisted | ||
| 2. **Task Cancellations**: After the canceled task state is persisted | ||
| 3. **Non-blocking/Background Operations**: After background consumption completes and final state is persisted | ||
|
|
||
| ### Component Architecture | ||
|
|
||
| **TaskStateProcessor** (new component): | ||
| - Instance created per `DefaultRequestHandler` to manage its in-flight tasks | ||
| - Provides thread-safe access via `ConcurrentHashMap` | ||
| - Separates state building from persistence concerns | ||
| - Enables explicit lifecycle management with `removeTask()` | ||
|
|
||
| **DefaultRequestHandler**: | ||
| - Creates and manages its own `TaskStateProcessor` instance | ||
| - Ensures tasks are removed after final persistence | ||
| - Passes state processor to components that need it | ||
|
|
||
| **ResultAggregator**: | ||
| - Uses `TaskStateProcessor` to build state during event consumption | ||
| - No longer performs persistence during event processing | ||
| - Removes tasks after background consumption completes | ||
|
|
||
| **TaskManager**: | ||
| - Delegates state building to `TaskStateProcessor` | ||
| - Coordinates between state processor and persistent store | ||
| - Supports dynamic task ID assignment for new tasks | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### Positive | ||
|
|
||
| 1. **Reduced I/O Operations**: Task state is persisted once per request lifecycle instead of multiple times during event propagation, significantly reducing database/storage load | ||
| 2. **No Memory Leaks**: Tasks are explicitly removed from in-memory state after completion, ensuring memory usage scales with concurrent tasks rather than total tasks processed | ||
| 3. **Better Test Isolation**: Each test creates its own state processor instance, providing natural isolation | ||
| 4. **Clear Separation of Concerns**: State building logic is separate from persistence logic, improving maintainability | ||
| 5. **Thread-Safe Design**: Uses concurrent data structures for safe access from multiple threads | ||
|
|
||
| ### Negative | ||
|
|
||
| 1. **Increased Complexity**: More components involved in task lifecycle management | ||
| 2. **Lifecycle Management Responsibility**: Must ensure cleanup is called at all task completion points | ||
| 3. **Constructor Changes**: All components creating `TaskManager` and `ResultAggregator` need updates to pass `TaskStateProcessor` | ||
|
|
||
| ### Test Impact | ||
|
|
||
| Test infrastructure was updated to create `TaskStateProcessor` instances: | ||
| - Test utilities updated to create and pass `TaskStateProcessor` instances | ||
| - Each test creates its own state processor for proper isolation | ||
| - Test helper methods updated to handle non-existent tasks gracefully | ||
|
|
||
| ## Impacts | ||
|
|
||
| ### Performance | ||
| - **Improved**: Significantly reduced database/storage operations | ||
|
|
||
| ### Memory | ||
| - **Bounded**: Memory usage scales with concurrent tasks, not total tasks processed | ||
| - **Predictable**: Tasks are removed from memory after completion | ||
|
|
||
| ### Reliability | ||
| - **Improved**: Test isolation ensures reproducible test results | ||
| - **Improved**: Clearer task lifecycle reduces potential for bugs | ||
|
|
||
| ## Outstanding Considerations | ||
|
|
||
| ### Streaming Task Lifecycle | ||
|
|
||
| For streaming responses where clients disconnect mid-stream, background consumption handles cleanup. Tasks remain in memory until background processing completes, creating a brief retention window. | ||
|
|
||
| **Impact**: Low - tasks are eventually cleaned up, retention is temporary | ||
|
|
||
| ### Error Handling Edge Cases | ||
|
|
||
| If catastrophic failures occur during event processing before final persistence, tasks might remain orphaned in `TaskStateProcessor`. | ||
|
|
||
| **Mitigation**: Most error paths persist task state (including error information), triggering cleanup | ||
|
|
||
| **Recommendation**: Consider adding periodic sweep of old tasks or timeout-based cleanup | ||
|
|
||
| ### Concurrent Access Patterns | ||
|
|
||
| The `TaskStateProcessor` ensures thread-safe access via concurrent data structures. Event ordering is maintained by the underlying `EventQueue` system. | ||
|
|
||
| **Impact**: None - existing event ordering guarantees are preserved | ||
|
|
||
| ## Future Enhancements | ||
|
|
||
| 1. **Observability**: Add metrics for in-flight task count to monitor system health | ||
| 2. **Cleanup Monitoring**: Add logging/metrics when tasks are removed for debugging | ||
| 3. **Timeout Cleanup**: Implement periodic sweep of tasks exceeding age threshold | ||
| 4. **Retention Policies**: Consider configurable retention for debugging (e.g., keep recent tasks for N minutes) | ||
|
|
||
| ## Verification | ||
|
|
||
| All tests passing with the refactoring: | ||
| - server-common: 223 tests | ||
| - QuarkusA2AJSONRPCTest: 42 tests | ||
| - QuarkusA2AGrpcTest: 42 tests | ||
|
|
||
| Recommended manual testing: | ||
| - Long-running tasks to verify no memory growth | ||
| - Streaming scenarios with client disconnects | ||
| - Error scenarios to verify cleanup | ||
| - Concurrent task processing | ||
|
|
||
| ## Files Changed | ||
|
|
||
| Core implementation: | ||
| - `server-common/src/main/java/io/a2a/server/tasks/TaskStateProcessor.java` (new) | ||
| - `server-common/src/main/java/io/a2a/server/requesthandlers/DefaultRequestHandler.java` | ||
| - `server-common/src/main/java/io/a2a/server/tasks/ResultAggregator.java` | ||
| - `server-common/src/main/java/io/a2a/server/tasks/TaskManager.java` | ||
|
|
||
| Test infrastructure: | ||
| - `server-common/src/test/java/io/a2a/server/tasks/TaskStateProcessorTest.java` (new) | ||
| - `tests/server-common/src/test/java/io/a2a/server/apps/common/AbstractA2AServerTest.java` | ||
| - All test files using `TaskManager` and `ResultAggregator` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,7 @@ | |
| import io.a2a.server.ServerCallContext; | ||
| import io.a2a.server.agentexecution.AgentExecutor; | ||
| import io.a2a.server.agentexecution.RequestContext; | ||
| import io.a2a.server.tasks.TaskStateProcessor; | ||
| import io.a2a.server.agentexecution.SimpleRequestContextBuilder; | ||
| import io.a2a.server.events.EnhancedRunnable; | ||
| import io.a2a.server.events.EventConsumer; | ||
|
|
@@ -103,6 +104,7 @@ public class DefaultRequestHandler implements RequestHandler { | |
|
|
||
| private final AgentExecutor agentExecutor; | ||
| private final TaskStore taskStore; | ||
| private final TaskStateProcessor stateProcessor; | ||
| private final QueueManager queueManager; | ||
| private final PushNotificationConfigStore pushConfigStore; | ||
| private final PushNotificationSender pushSender; | ||
|
|
@@ -119,6 +121,7 @@ public DefaultRequestHandler(AgentExecutor agentExecutor, TaskStore taskStore, | |
| PushNotificationSender pushSender, @Internal Executor executor) { | ||
| this.agentExecutor = agentExecutor; | ||
| this.taskStore = taskStore; | ||
| this.stateProcessor = new TaskStateProcessor(); | ||
| this.queueManager = queueManager; | ||
| this.pushConfigStore = pushConfigStore; | ||
| this.pushSender = pushSender; | ||
|
|
@@ -223,9 +226,10 @@ public Task onCancelTask(TaskIdParams params, ServerCallContext context) throws | |
| task.getId(), | ||
| task.getContextId(), | ||
| taskStore, | ||
| stateProcessor, | ||
| null); | ||
|
|
||
| ResultAggregator resultAggregator = new ResultAggregator(taskManager, null, executor); | ||
| ResultAggregator resultAggregator = new ResultAggregator(taskManager, null, executor, stateProcessor); | ||
|
|
||
| EventQueue queue = queueManager.tap(task.getId()); | ||
| if (queue == null) { | ||
|
|
@@ -249,13 +253,26 @@ public Task onCancelTask(TaskIdParams params, ServerCallContext context) throws | |
| throw new InternalError("Agent did not return valid response for cancel"); | ||
| } | ||
|
|
||
| // Persist the final task state (after all cancel events have been processed) | ||
| // This ensures state is saved ONCE before returning to client | ||
| Task finalTask = taskManager.getTask(); | ||
| if (finalTask != null) { | ||
| finalTask = taskManager.saveTask(finalTask); | ||
| } else { | ||
| finalTask = tempTask; | ||
| } | ||
jmesnil marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| // Verify task was actually canceled (not completed concurrently) | ||
| if (tempTask.getStatus().state() != TaskState.CANCELED) { | ||
| if (finalTask.getStatus().state() != TaskState.CANCELED) { | ||
| throw new TaskNotCancelableError( | ||
| "Task cannot be canceled - current state: " + tempTask.getStatus().state().asString()); | ||
| "Task cannot be canceled - current state: " + finalTask.getStatus().state().asString()); | ||
| } | ||
|
|
||
| return tempTask; | ||
| // Remove task from state processor after cancellation is complete | ||
| stateProcessor.removeTask(finalTask.getId()); | ||
| LOGGER.debug("Removed task {} from state processor after cancellation", finalTask.getId()); | ||
|
|
||
| return finalTask; | ||
| } | ||
|
|
||
| @Override | ||
|
|
@@ -267,7 +284,7 @@ public EventKind onMessageSend(MessageSendParams params, ServerCallContext conte | |
| LOGGER.debug("Request context taskId: {}", taskId); | ||
|
|
||
| EventQueue queue = queueManager.createOrTap(taskId); | ||
| ResultAggregator resultAggregator = new ResultAggregator(mss.taskManager, null, executor); | ||
| ResultAggregator resultAggregator = new ResultAggregator(mss.taskManager, null, executor, stateProcessor); | ||
|
|
||
| boolean blocking = true; // Default to blocking behavior | ||
| if (params.configuration() != null && Boolean.FALSE.equals(params.configuration().blocking())) { | ||
|
|
@@ -320,7 +337,7 @@ public EventKind onMessageSend(MessageSendParams params, ServerCallContext conte | |
| // 1. Wait for agent to finish enqueueing events | ||
| // 2. Close the queue to signal consumption can complete | ||
| // 3. Wait for consumption to finish processing events | ||
| // 4. Fetch final task state from TaskStore | ||
| // 4. Persist final task state ONCE to TaskStore | ||
|
|
||
| try { | ||
| // Step 1: Wait for agent to finish (with configurable timeout) | ||
|
|
@@ -360,15 +377,29 @@ public EventKind onMessageSend(MessageSendParams params, ServerCallContext conte | |
| throw new InternalError(msg); | ||
| } | ||
|
|
||
| // Step 4: Fetch the final task state from TaskStore (all events have been processed) | ||
| Task updatedTask = taskStore.get(taskId); | ||
| if (updatedTask != null) { | ||
| kind = updatedTask; | ||
| // Step 4: Persist the final task state (all events have been processed into currentTask) | ||
| // This ensures task state is saved ONCE before returning to client | ||
| Task finalTask = mss.taskManager.getTask(); | ||
| if (finalTask != null) { | ||
| finalTask = mss.taskManager.saveTask(finalTask); | ||
| kind = finalTask; | ||
| if (LOGGER.isDebugEnabled()) { | ||
| LOGGER.debug("Fetched final task for {} with state {} and {} artifacts", | ||
| taskId, updatedTask.getStatus().state(), | ||
| updatedTask.getArtifacts().size()); | ||
| LOGGER.debug("Persisted final task for {} with state {} and {} artifacts", | ||
| taskId, finalTask.getStatus().state(), | ||
| finalTask.getArtifacts().size()); | ||
| } | ||
| // Remove task from state processor after final persistence | ||
| stateProcessor.removeTask(taskId); | ||
| LOGGER.debug("Removed task {} from state processor after final persistence", taskId); | ||
| } | ||
| } else if (interruptedOrNonBlocking) { | ||
| // For non-blocking calls: persist the current state immediately | ||
| // Note: Do NOT remove from state processor here - background consumption may still be running | ||
| Task currentTask = mss.taskManager.getTask(); | ||
| if (currentTask != null) { | ||
| currentTask = mss.taskManager.saveTask(currentTask); | ||
| kind = currentTask; | ||
| LOGGER.debug("Persisted task state for non-blocking call: {}", taskId); | ||
| } | ||
| } | ||
| if (kind instanceof Task taskResult && !taskId.equals(taskResult.getId())) { | ||
|
|
@@ -401,7 +432,7 @@ public Flow.Publisher<StreamingEventKind> onMessageSendStream( | |
| AtomicReference<String> taskId = new AtomicReference<>(mss.requestContext.getTaskId()); | ||
| EventQueue queue = queueManager.createOrTap(taskId.get()); | ||
| LOGGER.debug("Created/tapped queue for task {}: {}", taskId.get(), queue); | ||
| ResultAggregator resultAggregator = new ResultAggregator(mss.taskManager, null, executor); | ||
| ResultAggregator resultAggregator = new ResultAggregator(mss.taskManager, null, executor, stateProcessor); | ||
|
|
||
| EnhancedRunnable producerRunnable = registerAndExecuteAgentAsync(taskId.get(), mss.requestContext, queue); | ||
|
|
||
|
|
@@ -419,6 +450,14 @@ public Flow.Publisher<StreamingEventKind> onMessageSendStream( | |
| Flow.Publisher<EventQueueItem> processed = | ||
| processor(createTubeConfig(), results, ((errorConsumer, item) -> { | ||
| Event event = item.getEvent(); | ||
|
|
||
| // For streaming: persist task state after each event before propagating | ||
| // This ensures state is saved BEFORE the event is sent to the client | ||
| Task currentTaskState = mss.taskManager.getTask(); | ||
| if (currentTaskState != null) { | ||
| mss.taskManager.saveTask(currentTaskState); | ||
| } | ||
|
Comment on lines
+454
to
+459
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change introduces task state persistence on every event for streaming responses. While the comment explains the rationale is to save state before sending an event to the client, this seems to contradict the primary goal of this refactoring, which is to reduce I/O and persist the task state only once at the end of its lifecycle. The PR description states: "Tasks are now persisted once at appropriate lifecycle points instead of for each event". This change for streaming calls deviates from that principle. If this per-event persistence is indeed necessary for streaming to handle client disconnects and allow for resubscription, this trade-off should be explicitly documented in the Architectural Decision Record (ADR) to clarify why streaming responses are treated differently from blocking/non-blocking calls. The current ADR gives the impression that all task processing now follows the "persist once" model. |
||
|
|
||
| if (event instanceof Task createdTask) { | ||
| if (!Objects.equals(taskId.get(), createdTask.getId())) { | ||
| errorConsumer.accept(new InternalError("Task ID mismatch in agent response")); | ||
|
|
@@ -600,8 +639,8 @@ public Flow.Publisher<StreamingEventKind> onResubscribeToTask( | |
| throw new TaskNotFoundError(); | ||
| } | ||
|
|
||
| TaskManager taskManager = new TaskManager(task.getId(), task.getContextId(), taskStore, null); | ||
| ResultAggregator resultAggregator = new ResultAggregator(taskManager, null, executor); | ||
| TaskManager taskManager = new TaskManager(task.getId(), task.getContextId(), taskStore, stateProcessor, null); | ||
| ResultAggregator resultAggregator = new ResultAggregator(taskManager, null, executor, stateProcessor); | ||
| EventQueue queue = queueManager.tap(task.getId()); | ||
| LOGGER.debug("onResubscribeToTask - tapped queue: {}", queue != null ? System.identityHashCode(queue) : "null"); | ||
|
|
||
|
|
@@ -797,6 +836,7 @@ private MessageSendSetup initMessageSend(MessageSendParams params, ServerCallCon | |
| params.message().getTaskId(), | ||
| params.message().getContextId(), | ||
| taskStore, | ||
| stateProcessor, | ||
| params.message()); | ||
|
|
||
| Task task = taskManager.getTask(); | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.