Skip to content

hparadiz/bpf_write_monitor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bpf_write_monitor

A zero-overhead CLI tool for capturing stdout/stderr output from a running process (and optionally its entire descendant tree) using eBPF tracepoints.

Useful for attaching to processes that have already started, that redirect their output to /dev/null, that are buried inside a service manager, or that you simply don't want to restart.


The intended use case for this tool was to capture output from cron jobs where output was not captured by anything. To be honest I never thought I'd get this working. It feels like pulling a rabbit out of a hat. 🪄🎩🐇


Requirements

Requirement Notes
Linux kernel ≥ 5.8 BPF_MAP_TYPE_PERCPU_ARRAY, perf buffer
clang BPF target compilation
gcc Userspace binary
libbpf + headers pkg-config libbpf must work
libelf, zlib Pulled in by libbpf
Root or CAP_BPF + CAP_PERFMON Required to load and attach BPF programs

Build

make

Produces two files:

File Description
bpf_write_monitor Userspace binary
kern.o BPF bytecode — must live in the same directory as the binary
make clean   # remove both

Usage

sudo bpf_write_monitor --pid <PID> [options]

Options

Option Description
--pid <PID> PID to monitor (required)
--stdout Capture fd 1 (stdout)
--stderr Capture fd 2 (stderr)
--with-timestamp Prefix each output line with HH:MM:SS.mmm
--with-origin-pid Prefix each output line with the writing PID
--with-origin-process-name Prefix each output line with the process name
--include-descendants Also monitor all current and future child processes
--exclude-kernel Skip writes from kernel threads (bracketed comms, e.g. [kworker])

If neither --stdout nor --stderr is given, --stdout is assumed.

Press Ctrl-C to stop. Any partial (newline-less) output buffered at exit is flushed automatically.

Examples

Capture everything a process tree writes to stdout or stderr, with full context:

sudo bpf_write_monitor \
  --pid 1234 \
  --stdout --stderr \
  --with-timestamp \
  --with-origin-pid \
  --with-origin-process-name \
  --include-descendants \
  --exclude-kernel

Watch only stderr of a single process, quietly:

sudo bpf_write_monitor --pid 5678 --stderr

Trace all writes under PID 1 (system-wide), suppress kernel threads:

sudo bpf_write_monitor --pid 1 --stdout --stderr \
  --include-descendants --exclude-kernel

Output format

Each line of output corresponds to one logical line of the target process's output. The tool maintains a per-PID tail buffer to reassemble lines that are split across multiple write() syscalls.

  • If a write contains a newline, all complete lines are emitted immediately.
  • If a write contains no newline and nothing was already buffered for that PID, the content is emitted immediately as-is (covers short-lived workers that write once without a trailing newline and then exit).
  • If a write contains no newline but there are already buffered bytes for that PID (i.e. a line is being assembled across multiple syscalls), the new bytes are held until the newline arrives.

Concurrent output from multiple descendants is never interleaved — each PID has its own independent tail buffer.

With all prefix options enabled the format is:

HH:MM:SS.mmm <pid> <comm> <stdout|stderr>: <line>

Example:

14:32:01.042 1847 nginx stdout: 2026/03/02 14:32:01 [notice] worker process started
14:32:01.043 1851 nginx stderr: 2026/03/02 14:32:01 [error] connect() failed

ANSI escape sequences (colours, cursor movement, OSC hyperlinks, set-title, etc.) are stripped from all output. Only printable ASCII, \t, and \n are passed through.

For writev(), sendmsg(), and sendmmsg() calls, data is captured from the first iovec only (up to 4096 bytes). All iovecs are still walked to compute the correct total orig_len for the write.

For splice() and sendfile64() — which transfer data entirely in-kernel with no userspace buffer — the tool emits an accounting line:

14:33:02.001 1847 nginx stdout: [4096 bytes via kernel transfer]

Architecture

For a detailed explanation of the design decisions, filtering strategy, BPF verifier constraints, and guidance on embedding this pattern elsewhere, see docs/architecture.md.

BPF programs (kern.c)

Eight tracepoints are loaded from kern.o:

Tracepoint Purpose
syscalls/sys_enter_write Capture write() payload
syscalls/sys_enter_pwrite64 Capture pwrite64() payload
syscalls/sys_enter_writev Capture first iovec of writev(); sum all iovecs for orig_len
syscalls/sys_enter_sendto Capture sendto() payload (fd 1/2)
syscalls/sys_enter_sendmsg Capture first iovec of sendmsg() (fd 1/2); sum all iovecs for orig_len
syscalls/sys_enter_sendmmsg Capture first iovec of first message of sendmmsg(); sum iovecs for orig_len
syscalls/sys_enter_splice Accounting event for splice() (no userspace buffer)
syscalls/sys_enter_sendfile64 Accounting event for sendfile() (no userspace buffer)
sched/sched_process_fork Propagate monitoring to child PIDs in-kernel
sched/sched_process_exit Remove dead PIDs from the map automatically
sched/sched_process_exec Re-confirm monitoring survives execve()

All filtering happens in the kernel before any memory copy or perf event is emitted. For non-matching PIDs the only cost is a single hash map lookup.

BPF maps

Map Type Purpose
events PERF_EVENT_ARRAY Per-CPU perf ring buffer to userspace
monitored_pids HASH pid → fd_mask (bit 0 = stdout, bit 1 = stderr); sized to pid_max at load time
write_probes_enabled ARRAY Global on/off flag; checked before everything else in should_capture()
self_pid ARRAY PID of the monitor process itself; excluded from capture to prevent feedback loops
scratch PERCPU_ARRAY Per-CPU scratch buffer for the 4 KB write_event struct (avoids the 512-byte BPF stack limit)

Idle overhead

Two independent mechanisms ensure zero overhead when the tool is not actively capturing:

  1. Detached probes — write syscall tracepoints are attached after PIDs are registered and the global flag is set. They are detached before the flag is cleared on exit. When the tool is not running, the kernel has no knowledge of these tracepoints.

  2. write_probes_enabled flag — checked as the very first instruction in should_capture(). A single BPF_MAP_TYPE_ARRAY read (direct indexed, not hashed) exits immediately when the value is 0. This guards against any race between flag state and probe lifetime.

Startup sequence

The attach order is intentional to eliminate race windows:

1. Attach fork / exit / exec probes
2. Register PIDs into monitored_pids  (+ scan /proc for existing descendants)
3. Set write_probes_enabled = 1
4. Attach write syscall probes

Teardown is the strict reverse:

1. Clear write_probes_enabled = 0
2. Destroy write probe links
3. Destroy lifecycle probe links
4. Close BPF object

Descendant tracking

When --include-descendants is used:

  • At startup, /proc is scanned once to build a flat PID→PPid table. A BFS traversal registers every existing descendant into monitored_pids.
  • During execution, the sched_process_fork tracepoint propagates the parent's fd_mask to new children entirely in-kernel with zero latency.
  • sched_process_exec re-registers the PID after execve() so monitoring survives image replacement.
  • sched_process_exit removes the dead PID from the map when the process group leader exits.

Process name capture

The process comm is captured by the BPF program via bpf_get_current_comm() at the exact moment of the write() syscall. This is immune to TOCTOU races that would affect a /proc/<pid>/comm lookup in userspace after the fact.


Limitations

  • Payload capped at 4096 bytes per syscall — larger writes are captured truncated; orig_len in the event records the actual requested size.

  • writev/sendmsg/sendmmsg capture first iovec only — data is read from iov[0] only; subsequent iovecs are walked solely to accumulate the correct orig_len. This restriction exists because the BPF verifier rejects variable-offset pointer arithmetic into map values on the kernel versions targeted. The total byte count is still accurate; only the captured payload is limited to the first scatter-gather segment.

  • sendmmsg captures first message only — only the first mmsghdr's first iovec is read; subsequent messages in the batch are not.

  • splice/sendfile carry no payload — these syscalls transfer data between file descriptors entirely in-kernel with no userspace buffer to read. The tool records the byte count and emits an accounting line; the actual content is not available.

  • x86-64 build only tested — the Makefile automatically selects the correct -D__TARGET_ARCH_* flag for x86-64, aarch64, armv7l, and riscv64, but only x86-64 has been run and verified. The tracepoint argument struct layouts are architecture-independent (they come from the kernel's format files), so other architectures should work in principle.

About

A zero-overhead CLI tool for capturing stdout/stderr output from a running process (and optionally its entire descendant tree) using eBPF tracepoints. Useful for attaching to processes that have already started, that redirect their output to `/dev/null`, that are buried inside a service manager, or that you simply don't want to restart.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors