feat: configurable WebSocket invocation payload limit across engine + SDKs#1593
feat: configurable WebSocket invocation payload limit across engine + SDKs#1593ytallo wants to merge 1 commit into
Conversation
… SDKs
Replace the silent 1 MiB cliff with an engine-enforced 16 MiB default that's
configurable end-to-end. Adds a specific error code, producer-side guards,
and aligns Python/Node/Rust SDKs on a single ceiling.
Engine
- WorkerManagerConfig.max_message_size (default 16 MiB), applied via
WebSocketUpgrade::max_message_size/max_frame_size on both ws_handler and
otel_ws_handler.
- New error code invocation_failed_payload_too_large emitted from
cleanup_worker when a worker disconnects due to a WS Capacity error.
invocation_stopped continues to cover clean disconnects, shutdown, and EOF.
- WS recv errors logged at WARN with peer/worker_id/error.
- DisconnectReason enum + halt_invocation_with_reason path.
SDKs (Python / Node / Rust)
- max_message_size / maxMessageSize init option, default 16 MiB.
- IIIPayloadTooLarge / IIIError::PayloadTooLarge raised producer-side
before the WS send when the envelope exceeds the limit.
- Cross-language error string aligned: "Payload {n} bytes exceeds invocation
limit {limit} bytes. For binary blobs use channels: <docs URL>".
- Python ws.connect now passes max_size, replacing the silent 1 MiB default
inherited from the websockets library.
Tests (TDD: red first, then green)
- engine/tests/ws_payload_limit_e2e.rs (4 e2e: oversize disconnect emits
payload_too_large; clean close still emits invocation_stopped; configured
limit; at-limit success).
- Unit tests for halt_invocation_with_reason and WorkerManagerConfig
default + override.
- SDK unit tests verify option plumbing, default value, producer-guard
raise, at-limit success. Each SDK ships an integration test gated on
III_URL that validates the engine contract end-to-end.
Docs
- Rewrote use-channels rule-of-thumb with the actual ceiling, base64+JSON
inflation math, and links to the new error-codes reference.
- New api-reference/error-codes.mdx enumerates engine error bodies grouped
by area; codes pulled from engine/src/.
- Init-option rows added to sdk-python / sdk-node / sdk-rust references.
- changelog/0-11-0/payload-size-limit.mdx documents the new config, error
code, SDK options, and the Python 1 MiB → 16 MiB behavior change.
Verification
- engine: cargo test --tests --no-fail-fast → 1805 passed.
- python: uv run pytest tests/test_payload_limits.py → 10 passed, 1 skipped.
- node: pnpm vitest run tests/payload-limits.test.ts → 4 passed, 1 skipped.
- rust: cargo test --test payload_limits → 4 passed, 1 ignored.
- docs: npx mintlify validate → success.
- Integration tests need III_URL pointing at a live engine; not blocking
this commit.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR implements configurable WebSocket invocation payload size limits across the engine and SDKs (Node, Python, Rust). The engine now classifies disconnects based on payload-too-large events and halts in-flight invocations with specific error codes; each SDK adds configurable max_message_size initialization options and producer-side guards that refuse to send oversized payloads before WebSocket round-trips, paired with typed error classes. ChangesEnd-to-End Payload Size Limits
Sequence Diagram(s)(Skipped — changes introduce validation gates and error classification rather than new sequential component interactions; the control flow enhancement is self-evident.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sdk/packages/python/iii/src/iii/iii_constants.py (1)
76-90:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate
max_message_sizewhenInitOptionsis constructed.
0or negative values currently create a nonsensical ceiling and can make every non-empty envelope fail much later during send/connect. A small__post_init__guard would fail fast on bad config.Suggested fix
`@dataclass` class InitOptions: @@ headers: dict[str, str] | None = None telemetry: TelemetryOptions | None = None max_message_size: int = DEFAULT_MAX_MESSAGE_SIZE + + def __post_init__(self) -> None: + if self.max_message_size <= 0: + raise ValueError("max_message_size must be > 0")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/packages/python/iii/src/iii/iii_constants.py` around lines 76 - 90, Add validation for max_message_size in the InitOptions constructor by implementing a __post_init__ on the InitOptions dataclass (or equivalent initializer) that checks the max_message_size field; if max_message_size is None or <= 0, raise a ValueError with a clear message referencing max_message_size and the DEFAULT_MAX_MESSAGE_SIZE constant, ensuring InitOptions (and related fields like worker_name, reconnection_config) fail fast on invalid sizes instead of allowing downstream send/connect errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@engine/src/engine/mod.rs`:
- Around line 1444-1455: In handle_otel, when matching Some(Err(_)) mirror the
new WARN path used for the main worker socket: call classify_recv_error on the
error, emit a tracing::warn including peer, worker.id, the error and the reason
(same fields as the other branch), then call
worker.set_disconnect_reason(reason).await and break; this ensures oversize
telemetry frames produce the same peer/error logging and disconnect reason as
the main worker recv handling.
In `@engine/tests/ws_payload_limit_e2e.rs`:
- Around line 283-292: The test currently guesses 256 bytes of slack when
constructing big_blob, which can break if the JSON envelope changes; instead
build the JSON envelope (result_msg) with a provisional blob, measure its
serialized length via to_string().len(), and reduce/truncate big_blob until the
serialized message length is <= the configured WS frame limit (use the same
limit constant used by production). Update the test to loop or compute the
correct blob size by measuring result_msg.to_string().len() and trimming
big_blob before calling ws.send(WsMessage::Text(result_msg.to_string().into()))
so the frame is precisely under the limit.
- Around line 59-60: Replace the fixed
tokio::time::sleep(Duration::from_millis(...)).await waits with state-based
waiting: instead of sleeping before returning (port, engine), poll for a
concrete readiness condition (e.g., attempt a TCP connect to the listener port
in a retry loop with a short backoff or call the engine’s registration/readiness
API) until it succeeds or a timeout is reached; locate the sleep calls
(tokio::time::sleep and Duration::from_millis) that use the local variables port
and engine and replace them with the retry/poll logic, and apply the same change
to the other occurrence around the later lines (the second sleep at 107-108).
Ensure the loop returns an error or panics on timeout to preserve test
determinism.
In `@sdk/packages/python/iii/src/iii/iii.py`:
- Around line 360-371: The _assert_within_limit helper is only used from
trigger_async but the handler result path can still send an oversized
InvocationResultMessage via _send; update the response path to call
_assert_within_limit before sending handler results. Specifically, in the code
path that constructs/sends InvocationResultMessage (referencing trigger_async,
InvocationResultMessage and _send), serialize the message payload the same way
(_to_dict -> json.dumps -> .encode("utf-8")) and call _assert_within_limit (or
inline the same check using self._options.max_message_size) and raise
IIIPayloadTooLarge if it exceeds the limit, then proceed to call _send only for
messages that pass the check.
In `@sdk/packages/python/iii/tests/test_payload_limits.py`:
- Around line 200-223: The test
test_oversize_invocation_returns_payload_too_large_code currently only creates a
single III client (client = III(...)) and never registers a real callee for
"noop", so on a fresh engine it can fail with function_not_found instead of
exercising the payload-too-large path; fix it by provisioning a separate
worker/registration that actually owns "noop" before calling client.trigger:
start a second III instance (or worker helper) with default InitOptions (no
increased max_message_size), register a noop handler for function_id "noop" and
ensure it is connected/registered (wait until connected) so the engine routes
the invocation to that callee, then run the trigger and finally clean up the
worker registration/connection in the test teardown or finally block.
- Around line 138-163: The test test_trigger_below_limit_does_not_raise
currently replaces client._send with fake_send but never asserts it was invoked;
modify the test to record and assert the call to client._send (e.g., replace
fake_send with an AsyncMock or a coroutine that sets an asyncio.Event/flag)
before returning, run the coroutine as before with
asyncio.run_coroutine_threadsafe(call(), client._loop).result(...), then assert
the AsyncMock was called or the Event/flag is set to confirm the send path was
exercised (referencing client._send, fake_send, and the inner call coroutine).
---
Outside diff comments:
In `@sdk/packages/python/iii/src/iii/iii_constants.py`:
- Around line 76-90: Add validation for max_message_size in the InitOptions
constructor by implementing a __post_init__ on the InitOptions dataclass (or
equivalent initializer) that checks the max_message_size field; if
max_message_size is None or <= 0, raise a ValueError with a clear message
referencing max_message_size and the DEFAULT_MAX_MESSAGE_SIZE constant, ensuring
InitOptions (and related fields like worker_name, reconnection_config) fail fast
on invalid sizes instead of allowing downstream send/connect errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8d8ac356-e8ed-4d9f-9368-acc4d371911e
⛔ Files ignored due to path filters (1)
sdk/packages/python/iii/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (29)
docs/api-reference/error-codes.mdxdocs/api-reference/sdk-node.mdxdocs/api-reference/sdk-python.mdxdocs/api-reference/sdk-rust.mdxdocs/changelog/0-11-0/payload-size-limit.mdxdocs/docs.jsondocs/how-to/use-channels.mdxengine/config.yamlengine/src/engine/mod.rsengine/src/invocation/mod.rsengine/src/worker_connections/mod.rsengine/src/workers/worker/mod.rsengine/tests/rbac_infrastructure_e2e.rsengine/tests/ws_payload_limit_e2e.rssdk/packages/node/iii/src/errors.tssdk/packages/node/iii/src/iii.tssdk/packages/node/iii/src/index.tssdk/packages/node/iii/tests/payload-limits.test.tssdk/packages/python/iii/pyproject.tomlsdk/packages/python/iii/src/iii/__init__.pysdk/packages/python/iii/src/iii/errors.pysdk/packages/python/iii/src/iii/iii.pysdk/packages/python/iii/src/iii/iii_constants.pysdk/packages/python/iii/tests/test_payload_limits.pysdk/packages/rust/iii/src/channels.rssdk/packages/rust/iii/src/error.rssdk/packages/rust/iii/src/iii.rssdk/packages/rust/iii/src/lib.rssdk/packages/rust/iii/tests/payload_limits.rs
| Some(Err(err)) => { | ||
| let reason = classify_recv_error(&err); | ||
| tracing::warn!( | ||
| peer = %peer, | ||
| worker_id = %worker.id, | ||
| error = %err, | ||
| reason = ?reason, | ||
| "Worker WS recv error" | ||
| ); | ||
| worker.set_disconnect_reason(reason).await; | ||
| break; | ||
| } |
There was a problem hiding this comment.
Mirror the new WARN path in handle_otel.
This adds recv-error logging for the main worker socket, but /otel still treats Some(Err(_)) as a silent break. Oversize telemetry frames will keep disconnecting without the peer/error context this PR now emits for normal worker sockets.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@engine/src/engine/mod.rs` around lines 1444 - 1455, In handle_otel, when
matching Some(Err(_)) mirror the new WARN path used for the main worker socket:
call classify_recv_error on the error, emit a tracing::warn including peer,
worker.id, the error and the reason (same fields as the other branch), then call
worker.set_disconnect_reason(reason).await and break; this ensures oversize
telemetry frames produce the same peer/error logging and disconnect reason as
the main worker recv handling.
| tokio::time::sleep(Duration::from_millis(150)).await; | ||
| (port, engine) |
There was a problem hiding this comment.
Replace the fixed sleeps with state-based waiting.
These sleeps are the only synchronization before the helper starts connecting and invoking. On slower CI, that can race listener startup or function registration and make the suite nondeterministic. Waiting for a concrete signal/retry condition here would make the E2E contract much more stable.
Also applies to: 107-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@engine/tests/ws_payload_limit_e2e.rs` around lines 59 - 60, Replace the fixed
tokio::time::sleep(Duration::from_millis(...)).await waits with state-based
waiting: instead of sleeping before returning (port, engine), poll for a
concrete readiness condition (e.g., attempt a TCP connect to the listener port
in a retry loop with a short backoff or call the engine’s registration/readiness
API) until it succeeds or a timeout is reached; locate the sleep calls
(tokio::time::sleep and Duration::from_millis) that use the local variables port
and engine and replace them with the retry/poll logic, and apply the same change
to the other occurrence around the later lines (the second sleep at 107-108).
Ensure the loop returns an error or panics on timeout to preserve test
determinism.
| // Build a payload that sits just under the limit. Reserve ~256 bytes | ||
| // for the JSON envelope. | ||
| let big_blob = "z".repeat(1024 * 1024 - 256); | ||
| let result_msg = json!({ | ||
| "type": "invocationresult", | ||
| "invocation_id": invocation_id.to_string(), | ||
| "function_id": "echo", | ||
| "result": { "blob": big_blob }, | ||
| }); | ||
| ws.send(WsMessage::Text(result_msg.to_string().into())) |
There was a problem hiding this comment.
Measure the serialized frame instead of guessing 256 bytes of slack.
This boundary test is tied to an estimate, not the actual on-wire size. If the envelope grows by even a couple of fields, the “at limit” case can start failing spuriously while the production code is still correct. Build the JSON first, measure to_string().len(), and trim the blob until it is within the configured limit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@engine/tests/ws_payload_limit_e2e.rs` around lines 283 - 292, The test
currently guesses 256 bytes of slack when constructing big_blob, which can break
if the JSON envelope changes; instead build the JSON envelope (result_msg) with
a provisional blob, measure its serialized length via to_string().len(), and
reduce/truncate big_blob until the serialized message length is <= the
configured WS frame limit (use the same limit constant used by production).
Update the test to loop or compute the correct blob size by measuring
result_msg.to_string().len() and trimming big_blob before calling
ws.send(WsMessage::Text(result_msg.to_string().into())) so the frame is
precisely under the limit.
| def _assert_within_limit(self, msg: Any) -> None: | ||
| """Reject oversize invocation envelopes before they reach the WS. | ||
|
|
||
| Raises IIIPayloadTooLarge if the serialized message exceeds | ||
| ``InitOptions.max_message_size``. Pre-flight rejection prevents one | ||
| oversize message from tearing the WS connection and halting every | ||
| in-flight invocation on the worker. | ||
| """ | ||
| limit = self._options.max_message_size | ||
| encoded = json.dumps(self._to_dict(msg)).encode("utf-8") | ||
| if len(encoded) > limit: | ||
| raise IIIPayloadTooLarge(payload_bytes=len(encoded), limit_bytes=limit) |
There was a problem hiding this comment.
Guard oversized handler results with the same preflight check.
This helper is only used from trigger_async(). A local handler that returns an oversize result still sends InvocationResultMessage through _send() unguarded, which can trip the engine-side WS limit and disconnect the worker. That leaves the Python SDK exposed to the same connection-tearing failure mode this change is trying to eliminate.
One way to wire the same protection into the response path
- await self._send(
- InvocationResultMessage(
- invocation_id=invocation_id,
- function_id=path,
- result=result,
- traceparent=response_traceparent,
- )
- )
+ result_msg = InvocationResultMessage(
+ invocation_id=invocation_id,
+ function_id=path,
+ result=result,
+ traceparent=response_traceparent,
+ )
+ try:
+ self._assert_within_limit(result_msg)
+ except IIIPayloadTooLarge as exc:
+ await self._send(
+ InvocationResultMessage(
+ invocation_id=invocation_id,
+ function_id=path,
+ error={
+ "code": "invocation_failed_payload_too_large",
+ "message": str(exc),
+ },
+ traceparent=response_traceparent,
+ )
+ )
+ return
+ await self._send(result_msg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sdk/packages/python/iii/src/iii/iii.py` around lines 360 - 371, The
_assert_within_limit helper is only used from trigger_async but the handler
result path can still send an oversized InvocationResultMessage via _send;
update the response path to call _assert_within_limit before sending handler
results. Specifically, in the code path that constructs/sends
InvocationResultMessage (referencing trigger_async, InvocationResultMessage and
_send), serialize the message payload the same way (_to_dict -> json.dumps ->
.encode("utf-8")) and call _assert_within_limit (or inline the same check using
self._options.max_message_size) and raise IIIPayloadTooLarge if it exceeds the
limit, then proceed to call _send only for messages that pass the check.
Summary
Lands a single, configurable WebSocket payload ceiling end-to-end (default 16 MiB), replacing inconsistent per-language behavior with one engine-enforced limit:
iii-worker-manager.max_message_size, applied to bothWebSocketUpgrade::max_message_sizeandmax_frame_sizeinvocation_failed_payload_too_largelets callers branch on payload-size failures distinct from genericinvocation_stoppedIIIPayloadTooLarge(Python, Node) /IIIError::PayloadTooLarge(Rust)Behavior change worth flagging
Python users get a 15× larger default. Before this PR, the Python SDK silently inherited the
websocketslibrary's 1 MiB max-message-size while Node and Rust effectively had no ceiling. Python callers occasionally saw mysteriousinvocation_stoppederrors that were actually the hidden 1 MiB cap firing.After this PR, the Python default is 16 MiB, matching Node, Rust, and the engine. Payloads between 1 MiB and 16 MiB that previously failed in Python will now succeed.
For payloads consistently above 16 MiB, use channels — they chunk over the same WebSocket without the per-message limit.
See
docs/changelog/0-11-0/payload-size-limit.mdxfor the full migration checklist, sizing notes (base64 + JSON envelope inflation), and the new error-codes reference.Verification
The commit message records local test results from the author. To re-verify on this PR:
cargo test --tests --no-fail-fastcleancd sdk/packages/python/iii && uv run pytest tests/test_payload_limits.pypnpm vitest run tests/payload-limits.test.tscargo test --test payload_limitsnpx mintlify validatefor the docs buildIII_URLpass against a live engine before release tagSummary by CodeRabbit
Release Notes
New Features
IIIPayloadTooLargeerror for client-side payload validationinvocation_failed_payload_too_largeerror code for oversized invocationsDocumentation
Bug Fixes