fix: properly detect zombie backend processes in is_running()#2298
fix: properly detect zombie backend processes in is_running()#2298aarononeal wants to merge 3 commits into
Conversation
When a backend process (e.g. llama-server) is OOM-killed by the kernel, it becomes a zombie (<defunct>) until reaped by its parent. The existing is_running() fallback used kill(pid, 0) which returns success for zombies since the PID still exists in the kernel process table. This caused the watchdog to incorrectly report the zombie as running, preventing automatic restart. Fix: Add a waitpid(pid, &status, WNOHANG) call before the kill(pid, 0) fallback to reap any zombie process. If waitpid returns > 0, the process has exited (including zombie state), so return false (not running). Only fall through to kill(pid, 0) if the process is truly alive.
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks for fixing this; the problem statement is valid, but I don’t think this implementation is safe to merge as-is.
ProcessManager::is_running() is explicitly documented as non-mutating on POSIX and must not reap children. Calling waitpid(..., WNOHANG) here can consume the child status from frequent health/status checks before the owning cleanup path has consumed the handle, which risks stale PID handles, PID reuse issues, and losing the real exit code for later reap_process() logging.
The existing architecture already separates liveness probing from cleanup: the watchdog consumes the handle once and then explicitly calls reap_process(). We should preserve that model.
Suggested direction: keep is_running() read-only. On Linux, make the kill(pid, 0) fallback zombie-aware without reaping, for example by checking /proc//status or /proc//stat for zombie state before falling back to kill(). Also, please add a test that verifies is_running() returns false for an exited child while a subsequent reap_process() still retrieves the real exit code.
Use /proc/<pid>/stat to detect zombie processes instead of waitpid(), preventing unintended reaping. Add CMake test target and unit tests to verify the non-mutating contract is preserved.
|
Good call and thanks for reviewing. I've revised the patch accordingly. |
|
Thanks for adapting - this addresses my points. One remaining blocker: the new CMake test target is not guarded as Linux-only. It directly compiles process_unix.cpp, and the test includes POSIX headers such as unistd.h and sys/wait.h, so this can break Windows builds. Since the actual fix is Linux /proc-specific, please wrap this test target in a Linux-only guard, e.g. if(CMAKE_SYSTEM_NAME STREQUAL "Linux" ...). Minor: the PR description still documents the old waitpid(..., WNOHANG) approach. Would be nice if you could update this as well. Thanks! |
|
@fl0rianr Thanks! Another good catch and PR updated. |
| } | ||
| #endif | ||
|
|
||
| #ifdef __linux__ |
There was a problem hiding this comment.
what's the point of the #ifdef check? Isn't this file only loaded on Linux anyway? Please don't litter the codebase with extra #ifdef that I spent a lot of effort in bf4c70b to remove.
| } | ||
| #endif | ||
|
|
||
| #ifdef __linux__ |
There was a problem hiding this comment.
pointless #ifdef, see my other comment.
| #include <thread> | ||
| #include <chrono> | ||
|
|
||
| #ifdef __linux__ |
There was a problem hiding this comment.
pointless #ifdef, see other comments
superm1
left a comment
There was a problem hiding this comment.
A few obvious things to change please. These will adjust quite a bit, I'll review more in depth after you've finished.
- Please remove all pointless
#ifdef - Please also review all comments, I think may are unnecessarily wordy.
Bug: Zombie backend process not detected by watchdog, no automatic restart
Summary
When the llama-server backend is killed by the Linux OOM killer, it becomes a zombie process (
<defunct>). The lemond watchdog fails to detect this zombie as "exited" becauseis_running()useskill(pid, 0)as its fallback check, which returns success for zombie processes. As a result, the backend remains in a dead/zombie state indefinitely and is never automatically restarted.Environment
restart: unless-stopped)Reproduction Steps
[llama-server] <defunct>)CURL error: Couldn't connect to serverExpected Behavior
The watchdog should detect that the backend process has exited (including zombie state) and either:
Actual Behavior
The zombie process persists.
is_running()returnstruefor the zombie, so:request_backend_reset_from_watchdog()Root Cause
In
src/cpp/server/utils/platform/process_unix.cpp, theis_running()function:The
waitid()withWNOWAITdetects exited children non-mutatingly when available, but on platforms withoutWNOWAIT(or whenwaitid()fails for reasons other thanECHILD), thekill(pid, 0)fallback is reached.kill(pid, 0)returns 0 (success) for zombie processes because the PID still exists in the kernel's process table, makingis_running()incorrectly report the zombie as "running."Fix Applied
The fix uses a Linux-specific
/proc/<pid>/statcheck to detect zombie processes without reaping them (preserving the non-mutating contract ofis_running()).is_zombie_by_proc()helper (process_unix.cpp:93-124)Updated
is_running()(process_unix.cpp:398-423)Design decisions
is_running(): The fix deliberately avoidswaitpid()withWNOHANGbecause reaping insideis_running()would break the contract —reap_process()needs to retrieve the actual exit code later. The/proc/<pid>/statapproach detects zombies purely by reading the state character, without consuming the child./proccheck is#ifdef __linux__only. macOS and other Unix platforms rely on theWNOWAITpath (available on Linux and some BSDs) or fall through tokill(pid, 0).waitid(WNOWAIT)for portable non-mutating exit detection, (2)/proc/<pid>/statfor Linux zombie detection, (3)kill(pid, 0)as a final fallback for "does the PID exist."Tests added (
test/cpp/test_process_manager.cpp)A standalone C++ test verifies the non-mutating contract:
is_running() returns false for exited childis_running()returns false without consuming the zombiereap_process() returns real exit code 42is_running()returned false,reap_process()retrieves the actual exit codeis_running() returns false for PID 0 / negative / non-existentis_running() returns true for running processreap_process() returns -1 for running processCMake integration
Test target
test_process_manageris conditionally built on Linux only (CMakeLists.txt), compiling the test alongside the process manager source files.Impact
WrappedServer(llamacpp, vllm, sd_server, etc.)Workaround (pre-fix)
Manually kill the zombie process and restart the container:
Or set
LEMONADE_BACKEND_WATCHDOG=0and rely on request-time detection (though this also has the zombie detection issue).Logs
The zombie persists indefinitely with no watchdog activity after the crash.