bpf_write_monitor

A zero-overhead CLI tool for capturing stdout/stderr output from a running process (and optionally its entire descendant tree) using eBPF tracepoints.

Useful for attaching to processes that have already started, that redirect their output to /dev/null, that are buried inside a service manager, or that you simply don't want to restart.

The intended use case for this tool was to capture output from cron jobs where output was not captured by anything. To be honest I never thought I'd get this working. It feels like pulling a rabbit out of a hat. 🪄🎩🐇

Requirements

Requirement	Notes
Linux kernel ≥ 5.8	`BPF_MAP_TYPE_PERCPU_ARRAY`, perf buffer
`clang`	BPF target compilation
`gcc`	Userspace binary
`libbpf` + headers	`pkg-config libbpf` must work
`libelf`, `zlib`	Pulled in by libbpf
Root or `CAP_BPF` + `CAP_PERFMON`	Required to load and attach BPF programs

Build

make

Produces two files:

File	Description
`bpf_write_monitor`	Userspace binary
`kern.o`	BPF bytecode — must live in the same directory as the binary

make clean   # remove both

Usage

sudo bpf_write_monitor --pid <PID> [options]

Options

Option	Description
`--pid <PID>`	PID to monitor (required)
`--stdout`	Capture fd 1 (stdout)
`--stderr`	Capture fd 2 (stderr)
`--with-timestamp`	Prefix each output line with `HH:MM:SS.mmm`
`--with-origin-pid`	Prefix each output line with the writing PID
`--with-origin-process-name`	Prefix each output line with the process name
`--include-descendants`	Also monitor all current and future child processes
`--exclude-kernel`	Skip writes from kernel threads (bracketed comms, e.g. `[kworker]`)

If neither --stdout nor --stderr is given, --stdout is assumed.

Press Ctrl-C to stop. Any partial (newline-less) output buffered at exit is flushed automatically.

Examples

Capture everything a process tree writes to stdout or stderr, with full context:

sudo bpf_write_monitor \
  --pid 1234 \
  --stdout --stderr \
  --with-timestamp \
  --with-origin-pid \
  --with-origin-process-name \
  --include-descendants \
  --exclude-kernel

Watch only stderr of a single process, quietly:

sudo bpf_write_monitor --pid 5678 --stderr

Trace all writes under PID 1 (system-wide), suppress kernel threads:

sudo bpf_write_monitor --pid 1 --stdout --stderr \
  --include-descendants --exclude-kernel

Output format

Each line of output corresponds to one logical line of the target process's output. The tool maintains a per-PID tail buffer to reassemble lines that are split across multiple write() syscalls.

If a write contains a newline, all complete lines are emitted immediately.
If a write contains no newline and nothing was already buffered for that PID, the content is emitted immediately as-is (covers short-lived workers that write once without a trailing newline and then exit).
If a write contains no newline but there are already buffered bytes for that PID (i.e. a line is being assembled across multiple syscalls), the new bytes are held until the newline arrives.

Concurrent output from multiple descendants is never interleaved — each PID has its own independent tail buffer.

With all prefix options enabled the format is:

HH:MM:SS.mmm <pid> <comm> <stdout|stderr>: <line>

Example:

14:32:01.042 1847 nginx stdout: 2026/03/02 14:32:01 [notice] worker process started
14:32:01.043 1851 nginx stderr: 2026/03/02 14:32:01 [error] connect() failed

ANSI escape sequences (colours, cursor movement, OSC hyperlinks, set-title, etc.) are stripped from all output. Only printable ASCII, \t, and \n are passed through.

For writev(), sendmsg(), and sendmmsg() calls, data is captured from the first iovec only (up to 4096 bytes). All iovecs are still walked to compute the correct total orig_len for the write.

For splice() and sendfile64() — which transfer data entirely in-kernel with no userspace buffer — the tool emits an accounting line:

14:33:02.001 1847 nginx stdout: [4096 bytes via kernel transfer]

Architecture

For a detailed explanation of the design decisions, filtering strategy, BPF verifier constraints, and guidance on embedding this pattern elsewhere, see docs/architecture.md.

BPF programs (`kern.c`)

Eight tracepoints are loaded from kern.o:

Tracepoint	Purpose
`syscalls/sys_enter_write`	Capture `write()` payload
`syscalls/sys_enter_pwrite64`	Capture `pwrite64()` payload
`syscalls/sys_enter_writev`	Capture first iovec of `writev()`; sum all iovecs for `orig_len`
`syscalls/sys_enter_sendto`	Capture `sendto()` payload (fd 1/2)
`syscalls/sys_enter_sendmsg`	Capture first iovec of `sendmsg()` (fd 1/2); sum all iovecs for `orig_len`
`syscalls/sys_enter_sendmmsg`	Capture first iovec of first message of `sendmmsg()`; sum iovecs for `orig_len`
`syscalls/sys_enter_splice`	Accounting event for `splice()` (no userspace buffer)
`syscalls/sys_enter_sendfile64`	Accounting event for `sendfile()` (no userspace buffer)
`sched/sched_process_fork`	Propagate monitoring to child PIDs in-kernel
`sched/sched_process_exit`	Remove dead PIDs from the map automatically
`sched/sched_process_exec`	Re-confirm monitoring survives `execve()`

All filtering happens in the kernel before any memory copy or perf event is emitted. For non-matching PIDs the only cost is a single hash map lookup.

BPF maps

Map	Type	Purpose
`events`	`PERF_EVENT_ARRAY`	Per-CPU perf ring buffer to userspace
`monitored_pids`	`HASH`	`pid → fd_mask` (bit 0 = stdout, bit 1 = stderr); sized to `pid_max` at load time
`write_probes_enabled`	`ARRAY`	Global on/off flag; checked before everything else in `should_capture()`
`self_pid`	`ARRAY`	PID of the monitor process itself; excluded from capture to prevent feedback loops
`scratch`	`PERCPU_ARRAY`	Per-CPU scratch buffer for the 4 KB `write_event` struct (avoids the 512-byte BPF stack limit)

Idle overhead

Two independent mechanisms ensure zero overhead when the tool is not actively capturing:

Detached probes — write syscall tracepoints are attached after PIDs are registered and the global flag is set. They are detached before the flag is cleared on exit. When the tool is not running, the kernel has no knowledge of these tracepoints.
write_probes_enabled flag — checked as the very first instruction in should_capture(). A single BPF_MAP_TYPE_ARRAY read (direct indexed, not hashed) exits immediately when the value is 0. This guards against any race between flag state and probe lifetime.

Startup sequence

The attach order is intentional to eliminate race windows:

1. Attach fork / exit / exec probes
2. Register PIDs into monitored_pids  (+ scan /proc for existing descendants)
3. Set write_probes_enabled = 1
4. Attach write syscall probes

Teardown is the strict reverse:

1. Clear write_probes_enabled = 0
2. Destroy write probe links
3. Destroy lifecycle probe links
4. Close BPF object

Descendant tracking

When --include-descendants is used:

At startup, /proc is scanned once to build a flat PID→PPid table. A BFS traversal registers every existing descendant into monitored_pids.
During execution, the sched_process_fork tracepoint propagates the parent's fd_mask to new children entirely in-kernel with zero latency.
sched_process_exec re-registers the PID after execve() so monitoring survives image replacement.
sched_process_exit removes the dead PID from the map when the process group leader exits.

Process name capture

The process comm is captured by the BPF program via bpf_get_current_comm() at the exact moment of the write() syscall. This is immune to TOCTOU races that would affect a /proc/<pid>/comm lookup in userspace after the fact.

Limitations

Payload capped at 4096 bytes per syscall — larger writes are captured truncated; orig_len in the event records the actual requested size.
writev/sendmsg/sendmmsg capture first iovec only — data is read from iov[0] only; subsequent iovecs are walked solely to accumulate the correct orig_len. This restriction exists because the BPF verifier rejects variable-offset pointer arithmetic into map values on the kernel versions targeted. The total byte count is still accurate; only the captured payload is limited to the first scatter-gather segment.
sendmmsg captures first message only — only the first mmsghdr's first iovec is read; subsequent messages in the batch are not.
splice/sendfile carry no payload — these syscalls transfer data between file descriptors entirely in-kernel with no userspace buffer to read. The tool records the byte count and emits an accounting line; the actual content is not available.
x86-64 build only tested — the Makefile automatically selects the correct -D__TARGET_ARCH_* flag for x86-64, aarch64, armv7l, and riscv64, but only x86-64 has been run and verified. The tracepoint argument struct layouts are architecture-independent (they come from the kernel's format files), so other architectures should work in principle.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bpf_write_monitor

Requirements

Build

Usage

Options

Examples

Output format

Architecture

BPF programs (`kern.c`)

BPF maps

Idle overhead

Startup sequence

Descendant tracking

Process name capture

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bpf_write_monitor

Requirements

Build

Usage

Options

Examples

Output format

Architecture

BPF programs (kern.c)

BPF maps

Idle overhead

Startup sequence

Descendant tracking

Process name capture

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

BPF programs (`kern.c`)

Packages