Skip to content

fix: properly detect zombie backend processes in is_running()#2400

Open
bong-water-water-bong wants to merge 1 commit into
lemonade-sdk:mainfrom
bong-water-water-bong:fix/zombie-backend-detection
Open

fix: properly detect zombie backend processes in is_running()#2400
bong-water-water-bong wants to merge 1 commit into
lemonade-sdk:mainfrom
bong-water-water-bong:fix/zombie-backend-detection

Conversation

@bong-water-water-bong

Copy link
Copy Markdown
Contributor

Summary

Fixes the zombie backend process detection bug originally reported and fixed in #2298, with the review feedback from superm1 and fl0rianr applied.

When the llama-server backend is killed by the Linux OOM killer, it becomes a zombie process (<defunct>). The is_running() function uses kill(pid, 0) as its fallback check, which returns success for zombie processes because the PID still exists in the kernel's process table. As a result, the watchdog never detects the exit and the backend is never automatically restarted.

Changes

  • src/cpp/server/utils/platform/process_unix.cpp: Added is_zombie_by_proc() helper that reads /proc/<pid>/stat to detect zombie state without reaping the child. Called from is_running() before the kill(pid, 0) fallback.

  • test/cpp/test_process_manager.cpp: New standalone C++ test verifying the non-mutating contract:

    • is_running() returns false for an exited (zombie) child
    • reap_process() retrieves the real exit code after is_running() returned false
    • Edge cases for invalid PIDs
    • Live process detection
    • Running process reaping returns -1
  • CMakeLists.txt: Added test_process_manager target, built alongside the process manager source files.

Review feedback addressed (vs #2298)

  • Removed all unnecessary #ifdef __linux__ guards (this file is Linux-only)
  • Trimmed verbose comments in the helper and test file

Closes #2298

When the backend is OOM-killed, the zombie process is not detected
by kill(pid, 0), which returns success for zombies. This causes
the watchdog to never restart the backend.

Read /proc/<pid>/stat to check for 'Z' state before falling back
to kill(pid, 0). The check is non-mutating -- reap_process() can
still retrieve the real exit code.

Closes lemonade-sdk#2298
@github-actions github-actions Bot added bug Something isn't working cpp labels Jun 24, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c44b7c7a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread CMakeLists.txt
)
list(GET _PM_TEST_LIBS 0 _pm_lib0)
list(GET _PM_TEST_LIBS 1 _pm_lib1)
if(EXISTS "${_PM_TEST_SRC}" AND EXISTS "${_pm_lib0}" AND EXISTS "${_pm_lib1}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard the Unix-only process test by platform

This block is enabled solely by file existence, so the new target is created on Windows too. The default Windows build will then try to compile test_process_manager.cpp/process_unix.cpp, which include Unix-only headers and link pthread/-pthread, breaking cmake --build --preset windows; gate this test to the Unix/Linux platforms it supports.

Useful? React with 👍 / 👎.

}
}
int status = 0;
pid_t r = waitpid(child, &status, WNOHANG);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop reaping in the no-reap polling helper

If the fast _exit(42) child exits after the /proc stat is read but before this fallback runs, waitpid(..., WNOHANG) reaps it and the helper returns -1, so the first test fails before exercising ProcessManager::is_running(). Since this helper's contract is explicitly to wait without reaping, use only non-mutating polling such as /proc or waitid(..., WNOWAIT) here.

Useful? React with 👍 / 👎.

@bong-water-water-bong bong-water-water-bong left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes the #ifdef __linux__ guards that PR #2298 correctly added. Two concerns:

  1. process_unix.cpp is not Linux-only — per CMakeLists.txt, it compiles on all UNIX platforms. /proc/<pid>/stat parsing is Linux-specific.
  2. The CMake guard removal re-introduces a Windows build break that fl0rianr already flagged as a blocker on #2298.

PR #2298 is the more correct fix. Recommend closing this in favor of #2298.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cpp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant