Skip to content

fix: properly detect zombie backend processes in is_running()#2298

Open
aarononeal wants to merge 3 commits into
lemonade-sdk:mainfrom
aarononeal:fix/zombie-backend-detection
Open

fix: properly detect zombie backend processes in is_running()#2298
aarononeal wants to merge 3 commits into
lemonade-sdk:mainfrom
aarononeal:fix/zombie-backend-detection

Conversation

@aarononeal

@aarononeal aarononeal commented Jun 18, 2026

Copy link
Copy Markdown

Bug: Zombie backend process not detected by watchdog, no automatic restart

Summary

When the llama-server backend is killed by the Linux OOM killer, it becomes a zombie process (<defunct>). The lemond watchdog fails to detect this zombie as "exited" because is_running() uses kill(pid, 0) as its fallback check, which returns success for zombie processes. As a result, the backend remains in a dead/zombie state indefinitely and is never automatically restarted.

Environment

  • lemond version: 10.6.0
  • OS: Linux (Ubuntu, AMD ROCm / Strix Halo)
  • Container: Docker (restart: unless-stopped)
  • Backend: llamacpp (vulkan)

Reproduction Steps

  1. Run lemond with a large model that requires significant RAM (e.g., Qwen3.6-35B-A3B with 256k context)
  2. Send a prompt long enough to trigger OOM (the llama-server process uses ~91GB+ RAM)
  3. The Linux OOM killer terminates the llama-server process
  4. The process becomes a zombie ([llama-server] <defunct>)
  5. lemond does not detect the exit — the zombie persists indefinitely
  6. All subsequent requests fail with CURL error: Couldn't connect to server
  7. No automatic restart occurs

Expected Behavior

The watchdog should detect that the backend process has exited (including zombie state) and either:

  • Automatically restart the backend, or
  • At minimum, clean up the zombie and report the backend as unavailable so a retry can trigger a reload

Actual Behavior

The zombie process persists. is_running() returns true for the zombie, so:

  • The watchdog never calls request_backend_reset_from_watchdog()
  • The process handle and port are never consumed/cleaned up
  • All new requests fail with connection errors
  • The only recovery is to manually kill the zombie or restart the container

Root Cause

In src/cpp/server/utils/platform/process_unix.cpp, the is_running() function:

bool UnixProcessPlatform::is_running(ProcessHandle handle) {
    if (handle.pid <= 0) {
        return false;
    }

#ifdef WNOWAIT
    siginfo_t info;
    std::memset(&info, 0, sizeof(info));
    if (waitid(P_PID, static_cast<id_t>(handle.pid), &info, WEXITED | WNOHANG | WNOWAIT) == 0) {
        return info.si_pid == 0;
    }

    if (errno == ECHILD) {
        return false;
    }
#endif

    // BUG: kill(pid, 0) returns 0 for zombies (PID still in kernel table)
    errno = 0;
    return ::kill(handle.pid, 0) == 0 || errno == EPERM;
}

The waitid() with WNOWAIT detects exited children non-mutatingly when available, but on platforms without WNOWAIT (or when waitid() fails for reasons other than ECHILD), the kill(pid, 0) fallback is reached. kill(pid, 0) returns 0 (success) for zombie processes because the PID still exists in the kernel's process table, making is_running() incorrectly report the zombie as "running."

Fix Applied

The fix uses a Linux-specific /proc/<pid>/stat check to detect zombie processes without reaping them (preserving the non-mutating contract of is_running()).

is_zombie_by_proc() helper (process_unix.cpp:93-124)

#ifdef __linux__
static bool is_zombie_by_proc(pid_t pid) {
    char path[64];
    std::snprintf(path, sizeof(path), "/proc/%d/stat", pid);
    std::ifstream f(path);
    if (!f.is_open()) {
        return false;
    }
    std::string line;
    if (!std::getline(f, line)) {
        return false;
    }
    // /proc/<pid>/stat format: pid (comm) field3 field4 ...
    // The comm field may contain spaces, so find both parentheses.
    auto open_paren = line.find('(');
    if (open_paren == std::string::npos) {
        return false;
    }
    auto close_paren = line.rfind(')');
    if (close_paren == std::string::npos || close_paren <= open_paren) {
        return false;
    }
    // Field 1 after ')' is the process state character.
    std::string rest = line.substr(close_paren + 1);
    std::istringstream iss(rest);
    char state;
    if (!(iss >> state)) {
        return false;
    }
    return state == 'Z';
}
#endif

Updated is_running() (process_unix.cpp:398-423)

bool UnixProcessPlatform::is_running(ProcessHandle handle) {
    if (handle.pid <= 0) {
        return false;
    }

#ifdef WNOWAIT
    siginfo_t info;
    std::memset(&info, 0, sizeof(info));
    if (waitid(P_PID, static_cast<id_t>(handle.pid), &info, WEXITED | WNOHANG | WNOWAIT) == 0) {
        return info.si_pid == 0;
    }

    if (errno == ECHILD) {
        return false;
    }
#endif

#ifdef __linux__
    if (is_zombie_by_proc(handle.pid)) {
        return false;
    }
#endif

    errno = 0;
    return ::kill(handle.pid, 0) == 0 || errno == EPERM;
}

Design decisions

  • Non-mutating is_running(): The fix deliberately avoids waitpid() with WNOHANG because reaping inside is_running() would break the contract — reap_process() needs to retrieve the actual exit code later. The /proc/<pid>/stat approach detects zombies purely by reading the state character, without consuming the child.
  • Platform scoped: The /proc check is #ifdef __linux__ only. macOS and other Unix platforms rely on the WNOWAIT path (available on Linux and some BSDs) or fall through to kill(pid, 0).
  • Defense in depth: The detection order is: (1) waitid(WNOWAIT) for portable non-mutating exit detection, (2) /proc/<pid>/stat for Linux zombie detection, (3) kill(pid, 0) as a final fallback for "does the PID exist."

Tests added (test/cpp/test_process_manager.cpp)

A standalone C++ test verifies the non-mutating contract:

Test What it verifies
is_running() returns false for exited child Spawns a child, waits for exit without reaping, confirms is_running() returns false without consuming the zombie
reap_process() returns real exit code 42 After is_running() returned false, reap_process() retrieves the actual exit code
is_running() returns false for PID 0 / negative / non-existent Edge cases for invalid PIDs
is_running() returns true for running process Confirms live processes are still detected
reap_process() returns -1 for running process Confirms reaping a live process is a no-op

CMake integration

Test target test_process_manager is conditionally built on Linux only (CMakeLists.txt), compiling the test alongside the process manager source files.

Impact

  • Severity: High — causes complete service outage until manual intervention
  • Frequency: Occurs whenever the backend is OOM-killed or otherwise terminated externally
  • Recovery: With the fix applied, zombies are detected and the watchdog can trigger backend restart
  • Affected backends: All backends managed through WrappedServer (llamacpp, vllm, sd_server, etc.)

Workaround (pre-fix)

Manually kill the zombie process and restart the container:

docker exec lemonade-256k kill -9 <zombie_pid>
docker restart lemonade-256k

Or set LEMONADE_BACKEND_WATCHDOG=0 and rely on request-time detection (though this also has the zombie detection issue).

Logs

$ dmesg | grep -i oom
oom-kill:constraint=CONSTRAINT_NONE,...,task=llama-server,pid=3659732
Out of memory: Killed process 3659732 (llama-server) total-vm:135423044kB, anon-rss:95329868kB

$ docker exec lemonade-256k ps aux
root          43  8.3  0.0      0     0 ?        Z    03:13   7:25 [llama-server] <defunct>

$ docker logs --tail 5 lemonade-256k
2026-06-18 03:39:38.924 [Error] (HttpClient) Stream ended with: Transferred a partial file
2026-06-18 03:39:39.867 [Error] (HttpClient) CURL error: Couldn't connect to server
2026-06-18 03:39:39.870 [Error] (WrappedServer) Streaming request failed: CURL error: Couldn't connect to server

The zombie persists indefinitely with no watchdog activity after the crash.

When a backend process (e.g. llama-server) is OOM-killed by the kernel,
it becomes a zombie (<defunct>) until reaped by its parent. The existing
is_running() fallback used kill(pid, 0) which returns success for zombies
since the PID still exists in the kernel process table. This caused the
watchdog to incorrectly report the zombie as running, preventing
automatic restart.

Fix: Add a waitpid(pid, &status, WNOHANG) call before the kill(pid, 0)
fallback to reap any zombie process. If waitpid returns > 0, the process
has exited (including zombie state), so return false (not running). Only
fall through to kill(pid, 0) if the process is truly alive.
@github-actions github-actions Bot added the bug Something isn't working label Jun 18, 2026

@fl0rianr fl0rianr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this; the problem statement is valid, but I don’t think this implementation is safe to merge as-is.

ProcessManager::is_running() is explicitly documented as non-mutating on POSIX and must not reap children. Calling waitpid(..., WNOHANG) here can consume the child status from frequent health/status checks before the owning cleanup path has consumed the handle, which risks stale PID handles, PID reuse issues, and losing the real exit code for later reap_process() logging.

The existing architecture already separates liveness probing from cleanup: the watchdog consumes the handle once and then explicitly calls reap_process(). We should preserve that model.

Suggested direction: keep is_running() read-only. On Linux, make the kill(pid, 0) fallback zombie-aware without reaping, for example by checking /proc//status or /proc//stat for zombie state before falling back to kill(). Also, please add a test that verifies is_running() returns false for an exited child while a subsequent reap_process() still retrieves the real exit code.

Use /proc/<pid>/stat to detect zombie processes instead of waitpid(),
preventing unintended reaping. Add CMake test target and unit tests
to verify the non-mutating contract is preserved.
@aarononeal

Copy link
Copy Markdown
Author

Good call and thanks for reviewing. I've revised the patch accordingly.

@fl0rianr

Copy link
Copy Markdown
Collaborator

Thanks for adapting - this addresses my points.

One remaining blocker: the new CMake test target is not guarded as Linux-only. It directly compiles process_unix.cpp, and the test includes POSIX headers such as unistd.h and sys/wait.h, so this can break Windows builds. Since the actual fix is Linux /proc-specific, please wrap this test target in a Linux-only guard, e.g. if(CMAKE_SYSTEM_NAME STREQUAL "Linux" ...).

Minor: the PR description still documents the old waitpid(..., WNOHANG) approach. Would be nice if you could update this as well. Thanks!

@aarononeal

Copy link
Copy Markdown
Author

@fl0rianr Thanks! Another good catch and PR updated.

@sawansri sawansri requested a review from superm1 June 19, 2026 17:19
}
#endif

#ifdef __linux__

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of the #ifdef check? Isn't this file only loaded on Linux anyway? Please don't litter the codebase with extra #ifdef that I spent a lot of effort in bf4c70b to remove.

}
#endif

#ifdef __linux__

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointless #ifdef, see my other comment.

#include <thread>
#include <chrono>

#ifdef __linux__

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointless #ifdef, see other comments

@superm1 superm1 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few obvious things to change please. These will adjust quite a bit, I'll review more in depth after you've finished.

  1. Please remove all pointless #ifdef
  2. Please also review all comments, I think may are unnecessarily wordy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants