Skip to content

fix(kubernetes): treat null exit code as success in hydrateWithArchive#1915

Open
Buktal wants to merge 1 commit into
agentscope-ai:mainfrom
Buktal:fix/hydrate-archive-websocket-race-v2
Open

fix(kubernetes): treat null exit code as success in hydrateWithArchive#1915
Buktal wants to merge 1 commit into
agentscope-ai:mainfrom
Buktal:fix/hydrate-archive-websocket-race-v2

Conversation

@Buktal

@Buktal Buktal commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

hydrateWithArchive gets stuck or reports false "archive upload failed (exit=null)" when uploading tar archives to K8s sandbox pods.

Root Cause

Two bugs in one:

  1. Fabric8 / K8s exec WebSocket v4 protocol limitation — there is no way to close stdin without closing the entire WebSocket. When ExecWatch.close() runs, the WebSocket is shut before exit status (stream 3) arrives, so exitCode completes with null.

  2. Previous fix (fix(kubernetes): fix WebSocket race in hydrateWithArchive causing exit=null #1903) is incorrectexitCode.get() was moved inside the try-with-resources block, causing a deadlock: get() blocks waiting for exit code, but ExecWatch.close() (which triggers stdin EOF → cat exits → exit code arrives) only runs at try end.

Fix

// Before (#1903, wrong): exitCode.get() inside try → deadlock
try (ExecWatch watch = ...) {
    exitCode.get(130s);  // blocks forever — close() never runs
}

// After (correct): get() outside try, null = success
CompletableFuture<Integer> f;
try (ExecWatch watch = ...) {
    f = watch.exitCode();
}
Integer code = f.get(130s);  // close() already ran
if (code != null && code != 0) { throw ...; }
// null = WebSocket closed before exit status arrived, but data was written

Evidence from Fabric8 API

Fabric8's own ExecWatch.exitCode() JavaDoc documents this behavior:

/**
 * Will be -1 if the exit code can't be determined from the status,
 * or null if close is received before the exit code.
 *
 * See https://github.com/kubernetes/kubernetes/issues/89899
 * - which explains there's currently no way to indicate
 * end of input over a websocket, so you may not get
 * an exit code when using stdIn.
 */
CompletableFuture<Integer> exitCode();

Null exit code is expected behavior when using stdin with K8s exec WebSocket v4 protocolthe process ran and exited normally, but the WebSocket closed before the exit status message could arrive. This is not a real failure.

Testing

- Null exit code test updated from assertThrowsassertDoesNotThrow
- Non-zero exit code still correctly throws
- Zero exit code still passes successfully

The Fabric8 K8s exec WebSocket v4 protocol lacks per-channel stdin close.
ExecWatch.close() shuts the entire WebSocket before exit status (stream 3)
arrives, causing exitCode to complete with null.

Previous fix (agentscope-ai#1903) moved exitCode.get() inside the try block, which caused
a deadlock: get() blocks waiting for the exit code, but ExecWatch.close()
(which triggers stdin EOF and allows cat to exit) only runs at try end.

Correct approach: keep get() outside try (no deadlock), treat null as success.
Fabric8's ExecWatch.exitCode() docs confirm this is expected:
"there's currently no way to indicate end of input over a websocket, so you
may not get an exit code when using stdIn."
@Buktal Buktal requested a review from a team June 25, 2026 12:15
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...andbox/kubernetes/Fabric8KubernetesPodRuntime.java 80.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@itxaiohanglover

Copy link
Copy Markdown
Contributor

Nice fix! Moving exitCode().get() outside the try-with-resources is the right call — the ExecWatch resources need to close first. Treating null exit code as success makes sense for cat which exits cleanly on EOF. The error message with exit code and stderr buffer is much more helpful for debugging.

@Buktal

Buktal commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Priority PR. Previous fix was buggy & sub-par. Needs optimization. @chickenlj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants