fix: properly detect zombie backend processes in is_running()#2400
fix: properly detect zombie backend processes in is_running()#2400bong-water-water-bong wants to merge 1 commit into
Conversation
When the backend is OOM-killed, the zombie process is not detected by kill(pid, 0), which returns success for zombies. This causes the watchdog to never restart the backend. Read /proc/<pid>/stat to check for 'Z' state before falling back to kill(pid, 0). The check is non-mutating -- reap_process() can still retrieve the real exit code. Closes lemonade-sdk#2298
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c44b7c7a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
| list(GET _PM_TEST_LIBS 0 _pm_lib0) | ||
| list(GET _PM_TEST_LIBS 1 _pm_lib1) | ||
| if(EXISTS "${_PM_TEST_SRC}" AND EXISTS "${_pm_lib0}" AND EXISTS "${_pm_lib1}") |
There was a problem hiding this comment.
Guard the Unix-only process test by platform
This block is enabled solely by file existence, so the new target is created on Windows too. The default Windows build will then try to compile test_process_manager.cpp/process_unix.cpp, which include Unix-only headers and link pthread/-pthread, breaking cmake --build --preset windows; gate this test to the Unix/Linux platforms it supports.
Useful? React with 👍 / 👎.
| } | ||
| } | ||
| int status = 0; | ||
| pid_t r = waitpid(child, &status, WNOHANG); |
There was a problem hiding this comment.
Stop reaping in the no-reap polling helper
If the fast _exit(42) child exits after the /proc stat is read but before this fallback runs, waitpid(..., WNOHANG) reaps it and the helper returns -1, so the first test fails before exercising ProcessManager::is_running(). Since this helper's contract is explicitly to wait without reaping, use only non-mutating polling such as /proc or waitid(..., WNOWAIT) here.
Useful? React with 👍 / 👎.
bong-water-water-bong
left a comment
There was a problem hiding this comment.
This PR removes the #ifdef __linux__ guards that PR #2298 correctly added. Two concerns:
process_unix.cppis not Linux-only — per CMakeLists.txt, it compiles on all UNIX platforms./proc/<pid>/statparsing is Linux-specific.- The CMake guard removal re-introduces a Windows build break that fl0rianr already flagged as a blocker on #2298.
PR #2298 is the more correct fix. Recommend closing this in favor of #2298.
Summary
Fixes the zombie backend process detection bug originally reported and fixed in #2298, with the review feedback from
superm1andfl0rianrapplied.When the llama-server backend is killed by the Linux OOM killer, it becomes a zombie process (
<defunct>). Theis_running()function useskill(pid, 0)as its fallback check, which returns success for zombie processes because the PID still exists in the kernel's process table. As a result, the watchdog never detects the exit and the backend is never automatically restarted.Changes
src/cpp/server/utils/platform/process_unix.cpp: Addedis_zombie_by_proc()helper that reads/proc/<pid>/statto detect zombie state without reaping the child. Called fromis_running()before thekill(pid, 0)fallback.test/cpp/test_process_manager.cpp: New standalone C++ test verifying the non-mutating contract:is_running()returns false for an exited (zombie) childreap_process()retrieves the real exit code afteris_running()returned falseCMakeLists.txt: Addedtest_process_managertarget, built alongside the process manager source files.Review feedback addressed (vs #2298)
#ifdef __linux__guards (this file is Linux-only)Closes #2298