Note: This document is for SDK developers and contributors. For user-facing documentation, see the README.
This document explains the internal architecture, threading model, and extension points to help contributors understand how the SDK works under the hood. Core design decisions and advanced concepts are further outlined in the Architecture Decision Records.
aws-durable-execution-sdk-java/
├── sdk/ # Core SDK - DurableHandler, DurableContext, operations
├── sdk-testing/ # Test utilities for local and cloud testing
├── sdk-integration-tests/ # Integration tests using LocalDurableTestRunner
└── examples/ # Real-world usage patterns as customers would implement them
| Module | Purpose | Key Classes |
|---|---|---|
sdk |
Core runtime - extend DurableHandler, use DurableContext for durable operations |
DurableHandler, DurableContext, DurableExecutor, ExecutionManager |
sdk-testing |
Test utilities: LocalDurableTestRunner (in-memory, simulates re-invocations and time-skipping) and CloudDurableTestRunner (executes against deployed Lambda) |
LocalDurableTestRunner, CloudDurableTestRunner, LocalMemoryExecutionClient, TestResult |
sdk-integration-tests |
Dogfooding tests - validates the SDK using its own test utilities. Separate module keeps dependencies acyclic: sdk → sdk-testing → sdk-integration-tests. |
Test classes only |
examples |
Real-world usage patterns as customers would implement them, with local and cloud tests | Example handlers, CloudBasedIntegrationTest |
// Synchronous step (func receives a StepContext)
T step(String name, Class<T> type, Function<StepContext, T> func)
T step(String name, Class<T> type, Function<StepContext, T> func, StepConfig config)
T step(String name, TypeToken<T> type, Function<StepContext, T> func)
T step(String name, TypeToken<T> type, Function<StepContext, T> func, StepConfig config)
// Asynchronous step
DurableFuture<T> stepAsync(String name, Class<T> type, Function<StepContext, T> func)
DurableFuture<T> stepAsync(String name, Class<T> type, Function<StepContext, T> func, StepConfig config)
DurableFuture<T> stepAsync(String name, TypeToken<T> type, Function<StepContext, T> func)
DurableFuture<T> stepAsync(String name, TypeToken<T> type, Function<StepContext, T> func, StepConfig config)
// Wait
Void wait(String name, Duration duration)
// Asynchronous wait
DurableFuture<Void> waitAsync(String name, Duration duration)
// Invoke
T invoke(String name, String functionName, U payload, Class<T> resultType)
T invoke(String name, String functionName, U payload, TypeToken<T> resultType)
T invoke(String name, String functionName, U payload, Class<T> resultType, InvokeConfig config)
T invoke(String name, String functionName, U payload, TypeToken<T> resultType, InvokeConfig config)
DurableFuture<T> invokeAsync(String name, String functionName, U payload, Class<T> resultType)
DurableFuture<T> invokeAsync(String name, String functionName, U payload, Class<T> resultType, InvokeConfig config)
DurableFuture<T> invokeAsync(String name, String functionName, U payload, TypeToken<T> resultType)
DurableFuture<T> invokeAsync(String name, String functionName, U payload, TypeToken<T> resultType, InvokeConfig config)
// Callback
DurableCallbackFuture<T> createCallback(String name, Class<T> resultType)
DurableCallbackFuture<T> createCallback(String name, Class<T> resultType, CallbackConfig config)
DurableCallbackFuture<T> createCallback(String name, TypeToken<T> resultType)
DurableCallbackFuture<T> createCallback(String name, TypeToken<T> resultType, CallbackConfig config)
// Wait for callback (combines callback creation + submitter step)
T waitForCallback(String name, Class<T> resultType, BiConsumer<String, StepContext> func)
T waitForCallback(String name, TypeToken<T> resultType, BiConsumer<String, StepContext> func)
T waitForCallback(String name, Class<T> resultType, BiConsumer<String, StepContext> func, WaitForCallbackConfig config)
T waitForCallback(String name, TypeToken<T> resultType, BiConsumer<String, StepContext> func, WaitForCallbackConfig config)
DurableFuture<T> waitForCallbackAsync(String name, Class<T> resultType, BiConsumer<String, StepContext> func)
DurableFuture<T> waitForCallbackAsync(String name, TypeToken<T> resultType, BiConsumer<String, StepContext> func)
DurableFuture<T> waitForCallbackAsync(String name, Class<T> resultType, BiConsumer<String, StepContext> func, WaitForCallbackConfig config)
DurableFuture<T> waitForCallbackAsync(String name, TypeToken<T> resultType, BiConsumer<String, StepContext> func, WaitForCallbackConfig config)
// Child context
T runInChildContext(String name, Class<T> resultType, Function<DurableContext, T> func)
T runInChildContext(String name, TypeToken<T> resultType, Function<DurableContext, T> func)
T runInChildContext(String name, Class<T> resultType, Function<DurableContext, T> func, RunInChildContextConfig config)
T runInChildContext(String name, TypeToken<T> resultType, Function<DurableContext, T> func, RunInChildContextConfig config)
DurableFuture<T> runInChildContextAsync(String name, Class<T> resultType, Function<DurableContext, T> func)
DurableFuture<T> runInChildContextAsync(String name, TypeToken<T> resultType, Function<DurableContext, T> func)
DurableFuture<T> runInChildContextAsync(String name, Class<T> resultType, Function<DurableContext, T> func, RunInChildContextConfig config)
DurableFuture<T> runInChildContextAsync(String name, TypeToken<T> resultType, Function<DurableContext, T> func, RunInChildContextConfig config)
// Map
MapResult<O> map(String name, Collection<I> items, Class<O> resultType, MapFunction<I, O> function)
MapResult<O> map(String name, Collection<I> items, Class<O> resultType, MapFunction<I, O> function, MapConfig config)
MapResult<O> map(String name, Collection<I> items, TypeToken<O> resultType, MapFunction<I, O> function)
MapResult<O> map(String name, Collection<I> items, TypeToken<O> resultType, MapFunction<I, O> function, MapConfig config)
DurableFuture<MapResult<O>> mapAsync(String name, Collection<I> items, Class<O> resultType, MapFunction<I, O> function)
DurableFuture<MapResult<O>> mapAsync(String name, Collection<I> items, Class<O> resultType, MapFunction<I, O> function, MapConfig config)
DurableFuture<MapResult<O>> mapAsync(String name, Collection<I> items, TypeToken<O> resultType, MapFunction<I, O> function)
DurableFuture<MapResult<O>> mapAsync(String name, Collection<I> items, TypeToken<O> resultType, MapFunction<I, O> function, MapConfig config)
// Parallel
ParallelDurableFuture parallel(String name)
ParallelDurableFuture parallel(String name, ParallelConfig config)
// Wait for condition
T waitForCondition(String name, Class<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc)
T waitForCondition(String name, Class<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc, WaitForConditionConfig<T> config)
T waitForCondition(String name, TypeToken<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc)
T waitForCondition(String name, TypeToken<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc, WaitForConditionConfig<T> config)
DurableFuture<T> waitForConditionAsync(String name, Class<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc)
DurableFuture<T> waitForConditionAsync(String name, Class<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc, WaitForConditionConfig<T> config)
DurableFuture<T> waitForConditionAsync(String name, TypeToken<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc)
DurableFuture<T> waitForConditionAsync(String name, TypeToken<T> resultType, BiFunction<T, StepContext, WaitForConditionResult<T>> checkFunc, WaitForConditionConfig<T> config)
// Lambda context access
Context getLambdaContext()T get() // Blocks until complete, may suspend
static <T> List<T> allOf(DurableFuture<T>... futures) // Collect all results in order
static Object anyOf(DurableFuture<?>... futures) // Return first completed resultpublic class MyHandler extends DurableHandler<Input, Output> {
@Override
protected DurableConfig createConfiguration() {
return DurableConfig.builder()
.withLambdaClientBuilder(customLambdaClientBuilder)
.withSerDes(new CustomSerDes())
.withExecutorService(Executors.newFixedThreadPool(4))
.build();
}
}| Option | Default |
|---|---|
lambdaClientBuilder |
Auto-created LambdaClient for current region, primed for performance (see DurableConfig.java) |
serDes |
JacksonSerDes |
executorService |
Executors.newCachedThreadPool() (for user-defined operations only) |
loggerConfig |
LoggerConfig.defaults() (suppress replay logs) |
pollingStrategy |
Exponential backoff: 1s base, 2x rate, FULL jitter, 10s max |
checkpointDelay |
Duration.ofSeconds(0) (checkpoint as soon as possible) |
The SDK uses two separate thread pools with distinct responsibilities:
User Executor (DurableConfig.executorService):
- Runs user-defined operations (the code passed to
ctx.step()andctx.stepAsync()) - Configurable via
DurableConfig.builder().withExecutorService() - Default: cached daemon thread pool
Internal Executor (InternalExecutor.INSTANCE):
- Runs SDK coordination tasks: checkpoint batching, polling for wait completion
- Dedicated cached thread pool with daemon threads named
durable-sdk-internal-* - Not configurable by users
Benefits of this separation:
| Benefit | Description |
|---|---|
| Isolation | User operations can't starve SDK internals, and vice versa |
| No shutdown management | Internal pool uses daemon threads; SDK coordination continues even if the user's executor is shut down |
| Efficient resource usage | Cached thread pool creates threads on demand and reuses idle threads (60s timeout) |
| Daemon threads | Internal threads won't prevent JVM shutdown |
| Single configuration point | Changing InternalExecutor.INSTANCE in one place affects all SDK coordination |
Example: Custom thread pool for user operations:
@Override
protected DurableConfig createConfiguration() {
var executor = new ThreadPoolExecutor(
4, 10, // core/max threads
60L, TimeUnit.SECONDS, // idle timeout
new LinkedBlockingQueue<>(100), // bounded queue
new ThreadFactoryBuilder()
.setNameFormat("order-processor-%d")
.setDaemon(true)
.build());
return DurableConfig.builder()
.withExecutorService(executor)
.build();
}context.step("name", Type.class, stepCtx -> doWork(),
StepConfig.builder()
.serDes(stepSpecificSerDes)
.retryStrategy(RetryStrategies.exponentialBackoff(3, Duration.ofSeconds(1)))
.semantics(AT_MOST_ONCE_PER_RETRY)
.build());┌─────────────────────────────────────────────────────────────────────────┐
│ Lambda Runtime │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ DurableHandler<I,O> │
│ - Entry point (RequestStreamHandler) │
│ - Extracts input type via reflection │
│ - Delegates to DurableExecutor │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ DurableExecutor │
│ - Creates ExecutionManager, DurableContext │
│ - Runs handler in executor │
│ - Waits for completion OR suspension │
│ - Returns SUCCESS/PENDING/FAILED │
└─────────────────────────────────────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌──────────────────────────────┐ ┌─────────────────────────────────┐
│ DurableContext │ │ ExecutionManager │
│ - User-facing API │ │ - State (ops, token) │
│ - step(), stepAsync() │ │ - Thread coordination │
│ - wait(), waitAsync() │ │ - Checkpoint batching │
│ - invoke(), invokeAsync() │ │ - Checkpoint response handling │
│ - createCallback() │ │ - Polling │
│ - waitForCallback() │ └─────────────────────────────────┘
│ - runInChildContext() │
│ - map(), mapAsync() │
│ - parallel() │
│ - waitForCondition() │
│ - Operation ID counter │
└──────────────────────────────┘
│ │
▼ ▼
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ Operations │ │ CheckpointBatcher │
│ - StepOperation<T> │ │ - Queues requests │
│ - WaitOperation │ │ - Batches API calls (750KB) │
│ - InvokeOperation<T> │ │ │
│ - CallbackOperation<T> │ │ - Notifies via callback │
│ - WaitForConditionOperation │ └──────────────────────────────┘
│ - ConcurrencyOperation<T> │
│ - MapOperation<I,O> │
│ - ParallelOperation │
│ - ChildContextOperation<T> │
│ - execute() / get() │
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ DurableExecutionClient │
│ - checkpoint() │
│ - getExecutionState() │
└──────────────────────────────┘
software.amazon.lambda.durable
├── DurableHandler<I,O> # Entry point
├── DurableExecutor # Lifecycle orchestration
├── DurableContext # User API (interface)
├── DurableFuture<T> # Async handle
├── DurableCallbackFuture<T> # Callback future with callbackId
├── ParallelDurableFuture # Parallel branch registration + AutoCloseable
├── StepContext # Context passed to step functions
├── TypeToken<T> # Generic type capture
│
├── config/
│ ├── StepConfig # Step configuration (retry, semantics, serDes)
│ ├── InvokeConfig # Invoke configuration (payload/result serDes, tenantId)
│ ├── CallbackConfig # Callback configuration (timeout, heartbeat, serDes)
│ ├── WaitForCallbackConfig # Composite callback + step config
│ ├── MapConfig # Map configuration (concurrency, completion, serDes)
│ ├── ParallelConfig # Parallel configuration (concurrency, completion)
│ ├── ParallelBranchConfig # Per-branch configuration
│ ├── RunInChildContextConfig # Child context configuration
│ ├── WaitForConditionConfig<T> # Polling configuration (wait strategy, serDes, initialState)
│ └── CompletionConfig # Completion criteria for map/parallel
│
├── context/
│ └── BaseContext # Base interface for DurableContext
│
├── execution/
│ ├── ExecutionManager # Central coordinator
│ ├── ExecutionMode # REPLAY or EXECUTION state
│ ├── CheckpointBatcher # Batching (package-private)
│ ├── CheckpointCallback # Callback interface
│ ├── SuspendExecutionException
│ └── ThreadType # CONTEXT, STEP
│
├── operation/
│ ├── BaseDurableOperation<T> # Common operation logic
│ ├── StepOperation<T> # Step logic
│ ├── InvokeOperation<T> # Invoke logic
│ ├── CallbackOperation<T> # Callback logic
│ ├── WaitOperation # Wait logic
│ ├── WaitForConditionOperation<T> # Polling condition logic
│ ├── ConcurrencyOperation<T> # Shared base for map/parallel
│ ├── MapOperation<I,O> # Map operation logic
│ ├── ParallelOperation # Parallel operation logic
│ └── ChildContextOperation<T> # Per-item child context execution
│
├── logging/
│ ├── DurableLogger # Context-aware logger wrapper (MDC-based)
│ └── LoggerConfig # Replay suppression config
│
├── retry/
│ ├── RetryStrategy # Interface
│ ├── RetryStrategies # Presets
│ ├── RetryDecision # shouldRetry + delay
│ ├── JitterStrategy # Jitter options
│ ├── PollingStrategy # Backend polling interface
│ ├── PollingStrategies # Backend polling presets
│ ├── WaitForConditionWaitStrategy # Polling delay interface
│ └── WaitStrategies # Polling strategy factory + Presets
│
├── client/
│ ├── DurableExecutionClient # Interface
│ └── LambdaDurableFunctionsClient # AWS SDK impl
│
├── model/
│ ├── DurableExecutionInput # Lambda input
│ ├── DurableExecutionOutput # Lambda output
│ ├── ExecutionStatus # SUCCEEDED/PENDING/FAILED
│ ├── MapResult<T> # Map operation result container
│ ├── MapResult.MapResultItem<T> # Per-item result (status, result, error)
│ ├── MapResult.MapError # Serializable error details
│ ├── ParallelResult # Parallel operation summary
│ ├── ConcurrencyCompletionStatus # ALL_COMPLETED/MIN_SUCCESSFUL_REACHED/FAILURE_TOLERANCE_EXCEEDED
│ └── WaitForConditionResult<T> # Check function return type (value + isDone)
│
├── serde/
│ ├── SerDes # Interface
│ ├── JacksonSerDes # Jackson impl
│ └── AwsSdkV2Module # SDK type support
│
└── exception/
├── DurableExecutionException
├── UnrecoverableDurableExecutionException
├── NonDeterministicExecutionException
├── IllegalDurableOperationException
├── DurableOperationException
├── StepException
├── StepFailedException
├── StepInterruptedException
├── InvokeException
├── InvokeFailedException
├── InvokeTimedOutException
├── InvokeStoppedException
├── CallbackException
├── CallbackFailedException
├── CallbackTimeoutException
├── CallbackSubmitterException
├── WaitForConditionFailedException
├── ChildContextFailedException
├── MapIterationFailedException
├── ParallelBranchFailedException
└── SerDesException
sequenceDiagram
participant UC as User Code
participant DC as DurableContext
participant SO as StepOperation
participant EM as ExecutionManager
participant Backend
UC->>DC: step("name", Type.class, stepCtx -> doWork())
DC->>SO: new StepOperation(...)
DC->>SO: execute()
SO->>EM: sendOperationUpdate(START)
EM->>Backend: checkpoint(START)
SO->>SO: func.apply(stepContext) [execute user code]
SO->>EM: sendOperationUpdate(SUCCEED)
EM->>Backend: checkpoint(SUCCEED)
DC->>SO: get()
SO-->>DC: result
DC-->>UC: result
sequenceDiagram
participant LR as Lambda Runtime
participant DE as DurableExecutor
participant UC as User Code
participant DC as DurableContext
participant SO as StepOperation
participant EM as ExecutionManager
Note over LR: Re-invocation with existing state
LR->>DE: execute(input with operations)
DE->>EM: new ExecutionManager(existingOps)
UC->>DC: step("step1", ...)
DC->>SO: execute()
SO->>EM: getOperation("1")
EM-->>SO: existing op (SUCCEEDED)
Note over SO: Skip execution
DC->>SO: get()
SO-->>DC: cached result
DC-->>UC: result
sequenceDiagram
participant UC as User Code
participant DC as DurableContext
participant WO as WaitOperation
participant EM as ExecutionManager
participant Backend
UC->>DC: wait(null, Duration.ofMinutes(5))
DC->>WO: execute()
WO->>EM: sendOperationUpdate(WAIT, duration)
EM->>Backend: checkpoint
DC->>WO: get()
WO->>EM: deregisterActiveThread("Root")
Note over EM: No active threads!
EM->>EM: executionExceptionFuture.completeExceptionally(SuspendExecutionException)
EM-->>WO: throw SuspendExecutionException
Note over UC: Execution suspended, returns PENDING
DurableExecutionException (base)
├── SerDesException # Serialization error
├── UnrecoverableDurableExecutionException # Execution cannot be recovered
│ ├── NonDeterministicExecutionException # Replay mismatch
│ └── IllegalDurableOperationException # Illegal operation detected
└── DurableOperationException # Operation-specific error
├── StepException # Step operation base
│ ├── StepFailedException # Step failed after all retries
│ └── StepInterruptedException # Step interrupted (AT_MOST_ONCE)
├── InvokeException # Invoke operation base
│ ├── InvokeFailedException # Invoked function returned error
│ ├── InvokeTimedOutException # Invoke exceeded timeout
│ └── InvokeStoppedException # Invoke stopped before completion
├── CallbackException # Callback operation base
│ ├── CallbackFailedException # External system sent error
│ ├── CallbackTimeoutException # Callback exceeded timeout
│ └── CallbackSubmitterException # Submitter step failed
├── WaitForConditionFailedException # Polling exceeded max attempts or failed
├── ChildContextFailedException # Child context failed (original exception not reconstructable)
├── MapIterationFailedException # Map iteration failed (original exception not reconstructable)
└── ParallelBranchFailedException # Parallel branch failed (original exception not reconstructable)
SuspendExecutionException # Internal: triggers suspension (not user-facing)
| Exception | Trigger | Recovery |
|---|---|---|
StepFailedException |
Step throws after exhausting retries | Catch in handler or let fail |
StepInterruptedException |
AT_MOST_ONCE step interrupted mid-execution | Treat as failure |
InvokeFailedException |
Invoked function returned an error | Catch in handler or let fail |
InvokeTimedOutException |
Invoke exceeded its timeout | Catch in handler or let fail |
InvokeStoppedException |
Invoke stopped before completion | Catch in handler or let fail |
CallbackFailedException |
External system sent an error response | Catch in handler or let fail |
CallbackTimeoutException |
Callback exceeded its timeout | Catch in handler or let fail |
CallbackSubmitterException |
Submitter step failed to submit callback | Catch in handler or let fail |
WaitForConditionFailedException |
waitForCondition exceeded max polling attempts or check function threw | Catch in handler or let fail |
ChildContextFailedException |
Child context failed and original exception not reconstructable | Catch in handler or let fail |
MapIterationFailedException |
Map iteration failed and original exception not reconstructable | Catch in handler or let fail |
ParallelBranchFailedException |
Parallel branch failed and original exception not reconstructable | Catch in handler or let fail |
NonDeterministicExecutionException |
Replay finds different operation than expected | Bug in handler (non-deterministic code) |
IllegalDurableOperationException |
Illegal operation detected | Bug in handler |
SerDesException |
Jackson fails to serialize/deserialize | Fix data model or custom SerDes |
ExecutionManager tracks whether we're replaying completed operations or executing new ones via ExecutionMode:
- REPLAY: Starts in this mode if
operations.size() > 1(has checkpointed operations beyond the initial EXECUTION op) - EXECUTION: Transitions when
getOperationAndUpdateReplayState()encounters:- An operation ID not in the checkpoint log (new operation)
- An operation that is NOT in a terminal state (needs to continue executing)
Terminal states (SUCCEEDED, FAILED, CANCELLED, TIMED_OUT, STOPPED) stay in REPLAY mode since we're just returning cached results.
This is a one-way transition (REPLAY → EXECUTION, never back). DurableLogger checks isReplaying() to suppress duplicate logs during replay.
DurableLogger uses SLF4J's MDC (Mapped Diagnostic Context) to enrich log entries with execution metadata. MDC is thread-local by design, so context is set once per thread rather than per log call for performance.
MDC Keys:
| Key | Set When | Description |
|---|---|---|
durableExecutionArn |
Logger construction | Execution ARN |
requestId |
Logger construction | Lambda request ID |
operationId |
Step start | Current operation ID |
operationName |
Step start | Step name |
attempt |
Step start | Retry attempt number |
Context Flow:
DurableLoggerconstructor sets execution-level MDC (ARN, requestId) on the handler threadStepOperation.executeStepLogic()callsdurableLogger.setOperationContext()before user code runs- User code logs via
context.getLogger()- MDC values automatically included clearOperationContext()called in finally block after step completes
Log Pattern Example (Log4j2):
<PatternLayout pattern="%d %-5level %logger - %msg%notEmpty{ | arn=%X{durableExecutionArn}}%notEmpty{ id=%X{operationId}}%notEmpty{ op=%X{operationName}}%notEmpty{ attempt=%X{attempt}}%n"/>Output:
12:34:56 INFO c.a.l.d.DurableContext - Processing order | arn=arn:aws:lambda:us-east-1:123:function:test
12:34:56 DEBUG c.a.l.d.DurableContext - Validating items | arn=arn:aws:lambda:us-east-1:123:function:test id=1 op=validate attempt=0
If result > 6MB Lambda limit:
- Checkpoint result to backend
- Return empty response
- Backend stores and returns result
Multiple concurrent operations may checkpoint simultaneously. CheckpointBatcher batches these into single API calls to reduce latency and stay within the 750KB request limit.
The checkpointDelay configuration option (default: 0) controls how long the batcher waits before flushing, allowing more operations to accumulate in a single batch. For functions with many concurrent operations, setting a small delay (e.g., 10ms) can significantly reduce the number of API calls.
StepOperation 1 ──┐
│
StepOperation 2 ──┼──► CheckpointBatcher ──► Backend
│
WaitOperation ────┘
Callback mechanism avoids cyclic dependency between ExecutionManager and CheckpointBatcher:
interface CheckpointCallback {
void onComplete(String newToken, List<Operation> operations);
}In-memory test runner that simulates the full execution lifecycle without AWS.
// Default: auto-skip time
runner.runUntilComplete(input); // Instantly completes waits
// Manual control
runner.withSkipTime(false);
runner.run(input); // Returns PENDING at wait
runner.advanceTime(); // Move past wait
runner.run(input); // Continues from wait// Simulate checkpoint loss (fire-and-forget START lost)
runner.simulateFireAndForgetCheckpointLoss("step-name");
// Reset step to STARTED (simulate crash after START checkpoint)
runner.resetCheckpointToStarted("step-name");Tests against deployed Lambda:
var runner = CloudDurableTestRunner.create(arn, Input.class, Output.class)
.withPollInterval(Duration.ofSeconds(2))
.withTimeout(Duration.ofMinutes(5));
TestResult<Output> result = runner.run(input);DurableExecutionClient Interface - Backend abstraction for testing or alternative implementations:
public interface DurableExecutionClient {
CheckpointDurableExecutionResponse checkpoint(
String arn, String token, List<OperationUpdate> updates);
GetDurableExecutionStateResponse getExecutionState(String arn, String marker);
}Implementations:
LambdaDurableFunctionsClient- Production (wraps AWS SDK)LocalMemoryExecutionClient- Testing (in-memory)
For production customization, use DurableConfig.builder().withLambdaClientBuilder(lambdaClientBuilder).
For testing, use DurableConfig.builder().withDurableExecutionClient(localMemoryClient).
Custom SerDes Interface:
public interface SerDes {
String serialize(Object value);
<T> T deserialize(String data, Class<T> type);
<T> T deserialize(String data, TypeToken<T> typeToken);
}TypeToken and Type Erasure:
Java's type erasure removes generic type parameters at runtime (List<User> becomes List). This is problematic for deserialization—Jackson needs the full type to reconstruct objects correctly.
TypeToken<T> solves this by capturing generic types at compile time. Creating new TypeToken<List<User>>() {} produces an anonymous subclass whose superclass type parameter is preserved in bytecode and accessible via reflection (getGenericSuperclass()).
The SerDes interface provides both Class<T> and TypeToken<T> overloads:
- Use
Class<T>for simple types:String.class,User.class - Use
TypeToken<T>for parameterized types:new TypeToken<List<User>>() {}
The SDK uses a threaded execution model where the handler runs on a user-configured executor, racing against an internal suspension future. This enables immediate suspension when no thread can make forward progress (waits, retries, callbacks), without waiting for the handler to complete naturally.
Thread types. The SDK distinguishes two thread types via ThreadType:
| ThreadType | Identifier (threadId) | Created By | Purpose |
|---|---|---|---|
CONTEXT |
null for root context; the operation ID for child contexts (e.g. "hash(1)") |
DurableExecutor (root), ChildContextOperation (child) |
Runs the handler function body or a child context function body. Orchestrates operations. |
STEP |
The step's operation ID (e.g. "hash(2)") |
StepOperation |
Runs user-provided step code (Function<StepContext, T>). |
Each thread has a ThreadContext record (threadId + threadType) stored in a ThreadLocal so operations can identify which context they belong to.
Active thread set. ExecutionManager maintains a Set<String> activeThreads. A thread is "active" when it can make forward progress. When the set becomes empty, the execution suspends.
Completion futures. Each operation holds a CompletableFuture<Void> completionFuture used to coordinate between the thread that starts an operation and the thread that waits for its result.
DurableExecutor.execute() runs the handler on the user executor and races it against an internal exception future:
// DurableExecutor
executionManager.registerActiveThread(null); // register root context thread
var handlerFuture = CompletableFuture.supplyAsync(() -> {
try (var context = DurableContext.createRootContext(...)) {
return handler.apply(userInput, context);
}
}, config.getExecutorService());
executionManager.runUntilCompleteOrSuspend(handlerFuture)
.handle((result, ex) -> { ... })
.join();runUntilCompleteOrSuspend uses CompletableFuture.anyOf(handlerFuture, executionExceptionFuture):
- If
handlerFuturecompletes first →SUCCESS(orFAILEDif the handler threw). - If
executionExceptionFuturecompletes first →PENDING(suspension) or unrecoverable error.
See ADR-001: Threaded Handler Execution.
Suspension is triggered exclusively by ExecutionManager.deregisterActiveThread():
// ExecutionManager.deregisterActiveThread()
public void deregisterActiveThread(String threadId) {
if (executionExceptionFuture.isDone()) return; // already suspended
activeThreads.remove(threadId);
if (activeThreads.isEmpty()) {
suspendExecution(); // completes executionExceptionFuture with SuspendExecutionException
}
}A thread deregisters when it cannot make forward progress — typically when it calls waitForOperationCompletion() on an operation that hasn't completed yet. This is a unified mechanism: the SDK doesn't need operation-specific suspension logic.
This method in BaseDurableOperation is the core coordination primitive. It is called by every operation's get() method (step, wait, invoke, callback, child context):
// BaseDurableOperation.waitForOperationCompletion()
protected Operation waitForOperationCompletion() {
var threadContext = getCurrentThreadContext();
synchronized (completionFuture) {
if (!isOperationCompleted()) {
// Attach a callback: when the operation completes, re-register this thread
completionFuture.thenRun(() -> registerActiveThread(threadContext.threadId()));
// Deregister — may trigger suspension if no other threads are active
executionManager.deregisterActiveThread(threadContext.threadId());
}
}
completionFuture.join(); // block until complete (no-op if already done)
return getOperation();
}The synchronized(completionFuture) block prevents a race between checking isOperationCompleted() and attaching the thenRun callback. Without it, the future could complete between the check and the callback attachment, causing the thread to deregister without ever being re-registered.
The re-registration callback (thenRun) runs synchronously on the thread that completes the future (typically the checkpoint response handler). This guarantees the context thread is re-registered before the completing thread (step or child context) deregisters itself, preventing a premature suspension.
When CheckpointManager receives a checkpoint response, it calls ExecutionManager.onCheckpointComplete(), which notifies each registered operation:
// BaseDurableOperation.onCheckpointComplete()
public void onCheckpointComplete(Operation operation) {
if (ExecutionManager.isTerminalStatus(operation.status())) {
synchronized (completionFuture) {
completionFuture.complete(null); // unblocks waitForOperationCompletion()
}
}
}Completing the future triggers the thenRun callback (re-registers the waiting context thread), then unblocks the join() call.
Steps run user code on a separate thread via the user executor:
// StepOperation.executeStepLogic()
registerActiveThread(getOperationId()); // register BEFORE submitting to executor
CompletableFuture.runAsync(() -> {
try (StepContext stepContext = getContext().createStepContext(...)) {
T result = function.apply(stepContext);
handleStepSucceeded(result); // checkpoint SUCCEED synchronously
} catch (Throwable e) {
handleStepFailure(e, attempt); // checkpoint RETRY or FAIL
}
}, userExecutor);Key details:
registerActiveThreadis called on the parent thread beforerunAsync, preventing a race where the parent deregisters (triggering suspension) before the step thread starts.- The step thread is implicitly deregistered when it finishes — it never calls
deregisterActiveThreaddirectly. Instead, the step thread's work is done after checkpointing, and the checkpoint response completes thecompletionFuture, which re-registers the waiting context thread. - For retries, the step sends a RETRY checkpoint and then polls for the READY status before re-executing. If no other threads are active during the retry delay, the execution suspends.
Waits checkpoint a WAIT action with a duration, then poll for completion:
// WaitOperation.start()
sendOperationUpdate(OperationUpdate.builder()
.action(OperationAction.START)
.waitOptions(WaitOptions.builder().waitSeconds((int) duration.toSeconds()).build()));
pollForOperationUpdates(remainingWaitTime);The wait itself doesn't deregister any thread. Suspension happens when the context thread calls wait() (synchronous) which calls get(), which calls waitForOperationCompletion(), which deregisters the context thread. If no other threads are active, the execution suspends and the Lambda returns PENDING. On re-invocation, the wait replays: if the wait period has elapsed, markAlreadyCompleted() is called; otherwise, polling resumes with the remaining duration.
Invokes checkpoint a START action with the target function name and payload, then poll for the result. The threading model is identical to WaitOperation — the invoke itself doesn't create a new thread. The context thread deregisters when it calls get() on the invoke future.
Callbacks checkpoint a START action to obtain a callbackId, then poll for an external system to complete the callback. Like waits and invokes, the context thread deregisters when it calls get(). The callback can complete via an external API call (success, failure, or heartbeat timeout).
Child contexts run a user function in a separate thread with its own DurableContext and operation counter:
// ChildContextOperation.executeChildContext()
var contextId = getOperationId();
// Register on PARENT thread — prevents race with parent deregistration
registerActiveThread(contextId);
CompletableFuture.runAsync(() -> {
try (var childContext = getContext().createChildContext(contextId, getName())) {
T result = function.apply(childContext);
handleChildContextSuccess(result);
} catch (Throwable e) {
handleChildContextFailure(e);
}
}, userExecutor);Key details:
- The child context thread runs as
ThreadType.CONTEXT(not STEP), so it can itself create steps, waits, invokes, callbacks, and nested child contexts. - Operations within the child context use the child's
contextIdas theirparentId, and operation IDs are prefixed with the context path (e.g."hash(1)"for first-level,"hash(hash(1)-2)"for second-level). - On replay, if the child context completed with a large result (> 256KB), the SDK re-executes the child context to reconstruct the result in memory rather than storing it in the checkpoint payload.
When a wait, retry delay, or invoke would normally suspend execution, but other active threads prevent suspension (because activeThreads is not empty), the SDK stays alive and polls the backend for updates. This is the "in-process completion" path — the operation polls via CheckpointManager.pollForUpdate() on the internal executor until the backend reports the operation is ready. This avoids unnecessary Lambda re-invocations when the execution can simply wait in-process.
When a context thread calls ctx.step(...), the following coordination occurs:
| Seq | Context Thread | Step Thread | System Thread (CheckpointManager) |
|---|---|---|---|
| 1 | Create StepOperation + completionFuture. Call execute(). execute() calls start() which registers step thread and submits to user executor. Checkpoint START (sync or async depending on semantics). |
— | (idle) |
| 2 | step() calls get() → waitForOperationCompletion(). Attach thenRun(re-register) to completionFuture. Deregister context thread. Block on join(). |
User code begins executing. Execute function.apply(stepContext). |
(idle) |
| 3 | (blocked) | User code completes. Call handleStepSucceeded(result) → sendOperationUpdate(SUCCEED) (synchronous — blocks until checkpoint response). |
Process checkpoint API call. On terminal response, call onCheckpointComplete() → completionFuture.complete(null). thenRun fires: re-register context thread. |
| 4 | join() returns. Retrieve result from operation. |
Call deregisterActiveThread to deregister Step thread. Step thread ends. |
(idle) |
Alternative (fast step): If the step completes and checkpoints before the context thread calls get(), the completionFuture is already done when waitForOperationCompletion() runs. The context thread skips deregistration entirely and returns the result immediately.
| Seq | Context Thread | System Thread |
|---|---|---|
| 1 | Create WaitOperation + completionFuture. Call execute(). execute() calls start() → checkpoint WAIT with duration → pollForOperationUpdates(remainingWaitTime). |
Begin polling backend. |
| 2 | wait() calls get() → waitForOperationCompletion(). Attach thenRun(re-register). Deregister context thread. |
(polling) |
| 3 | activeThreads is empty → suspendExecution() → executionExceptionFuture.completeExceptionally(SuspendExecutionException). |
— |
| 4 | runUntilCompleteOrSuspend resolves with SuspendExecutionException → return PENDING. |
— |
On re-invocation, the wait replays. If the scheduled end time has passed, markAlreadyCompleted() fires and the context thread continues without deregistering.
var stepFuture = ctx.stepAsync("fetch", String.class, stepCtx -> callApi());
ctx.wait("delay", Duration.ofSeconds(30));
var result = stepFuture.get();| Seq | Context Thread | Step Thread | System Thread |
|---|---|---|---|
| 1 | Create StepOperation, register step thread, submit to executor. |
— | — |
| 2 | Create WaitOperation, checkpoint WAIT, start polling. |
User code begins. | Begin polling for wait. |
| 3 | wait() calls get() → deregister context thread. |
(running) | (polling) |
| 4 | (blocked — but step thread is still active, so no suspension) | Complete → checkpoint SUCCEED. | Process step checkpoint. |
| 5 | (blocked) | — | Wait poll returns SUCCEEDED → completionFuture.complete(null) for wait. Context thread re-registered. |
| 6 | wait() returns. stepFuture.get() → result already available. |
— | — |
If the wait duration hasn't elapsed when the step completes, the execution is suspended. If the step finishes after the wait, the step thread keeps the execution alive (prevents suspension) while the wait polls to completion.