A zero-overhead CLI tool for capturing stdout/stderr output from a running process (and optionally its entire descendant tree) using eBPF tracepoints.
Useful for attaching to processes that have already started, that redirect
their output to /dev/null, that are buried inside a service manager, or
that you simply don't want to restart.
The intended use case for this tool was to capture output from cron jobs where output was not captured by anything. To be honest I never thought I'd get this working. It feels like pulling a rabbit out of a hat. 🪄🎩🐇
| Requirement | Notes |
|---|---|
| Linux kernel ≥ 5.8 | BPF_MAP_TYPE_PERCPU_ARRAY, perf buffer |
clang |
BPF target compilation |
gcc |
Userspace binary |
libbpf + headers |
pkg-config libbpf must work |
libelf, zlib |
Pulled in by libbpf |
Root or CAP_BPF + CAP_PERFMON |
Required to load and attach BPF programs |
makeProduces two files:
| File | Description |
|---|---|
bpf_write_monitor |
Userspace binary |
kern.o |
BPF bytecode — must live in the same directory as the binary |
make clean # remove bothsudo bpf_write_monitor --pid <PID> [options]
| Option | Description |
|---|---|
--pid <PID> |
PID to monitor (required) |
--stdout |
Capture fd 1 (stdout) |
--stderr |
Capture fd 2 (stderr) |
--with-timestamp |
Prefix each output line with HH:MM:SS.mmm |
--with-origin-pid |
Prefix each output line with the writing PID |
--with-origin-process-name |
Prefix each output line with the process name |
--include-descendants |
Also monitor all current and future child processes |
--exclude-kernel |
Skip writes from kernel threads (bracketed comms, e.g. [kworker]) |
If neither --stdout nor --stderr is given, --stdout is assumed.
Press Ctrl-C to stop. Any partial (newline-less) output buffered at exit is flushed automatically.
Capture everything a process tree writes to stdout or stderr, with full context:
sudo bpf_write_monitor \
--pid 1234 \
--stdout --stderr \
--with-timestamp \
--with-origin-pid \
--with-origin-process-name \
--include-descendants \
--exclude-kernelWatch only stderr of a single process, quietly:
sudo bpf_write_monitor --pid 5678 --stderrTrace all writes under PID 1 (system-wide), suppress kernel threads:
sudo bpf_write_monitor --pid 1 --stdout --stderr \
--include-descendants --exclude-kernelEach line of output corresponds to one logical line of the target process's
output. The tool maintains a per-PID tail buffer to reassemble lines that
are split across multiple write() syscalls.
- If a write contains a newline, all complete lines are emitted immediately.
- If a write contains no newline and nothing was already buffered for that PID, the content is emitted immediately as-is (covers short-lived workers that write once without a trailing newline and then exit).
- If a write contains no newline but there are already buffered bytes for that PID (i.e. a line is being assembled across multiple syscalls), the new bytes are held until the newline arrives.
Concurrent output from multiple descendants is never interleaved — each PID has its own independent tail buffer.
With all prefix options enabled the format is:
HH:MM:SS.mmm <pid> <comm> <stdout|stderr>: <line>
Example:
14:32:01.042 1847 nginx stdout: 2026/03/02 14:32:01 [notice] worker process started
14:32:01.043 1851 nginx stderr: 2026/03/02 14:32:01 [error] connect() failed
ANSI escape sequences (colours, cursor movement, OSC hyperlinks, set-title, etc.)
are stripped from all output. Only printable ASCII, \t, and \n are passed
through.
For writev(), sendmsg(), and sendmmsg() calls, data is captured from
the first iovec only (up to 4096 bytes). All iovecs are still walked to
compute the correct total orig_len for the write.
For splice() and sendfile64() — which transfer data entirely in-kernel
with no userspace buffer — the tool emits an accounting line:
14:33:02.001 1847 nginx stdout: [4096 bytes via kernel transfer]
For a detailed explanation of the design decisions, filtering strategy, BPF verifier constraints, and guidance on embedding this pattern elsewhere, see docs/architecture.md.
Eight tracepoints are loaded from kern.o:
| Tracepoint | Purpose |
|---|---|
syscalls/sys_enter_write |
Capture write() payload |
syscalls/sys_enter_pwrite64 |
Capture pwrite64() payload |
syscalls/sys_enter_writev |
Capture first iovec of writev(); sum all iovecs for orig_len |
syscalls/sys_enter_sendto |
Capture sendto() payload (fd 1/2) |
syscalls/sys_enter_sendmsg |
Capture first iovec of sendmsg() (fd 1/2); sum all iovecs for orig_len |
syscalls/sys_enter_sendmmsg |
Capture first iovec of first message of sendmmsg(); sum iovecs for orig_len |
syscalls/sys_enter_splice |
Accounting event for splice() (no userspace buffer) |
syscalls/sys_enter_sendfile64 |
Accounting event for sendfile() (no userspace buffer) |
sched/sched_process_fork |
Propagate monitoring to child PIDs in-kernel |
sched/sched_process_exit |
Remove dead PIDs from the map automatically |
sched/sched_process_exec |
Re-confirm monitoring survives execve() |
All filtering happens in the kernel before any memory copy or perf event is emitted. For non-matching PIDs the only cost is a single hash map lookup.
| Map | Type | Purpose |
|---|---|---|
events |
PERF_EVENT_ARRAY |
Per-CPU perf ring buffer to userspace |
monitored_pids |
HASH |
pid → fd_mask (bit 0 = stdout, bit 1 = stderr); sized to pid_max at load time |
write_probes_enabled |
ARRAY |
Global on/off flag; checked before everything else in should_capture() |
self_pid |
ARRAY |
PID of the monitor process itself; excluded from capture to prevent feedback loops |
scratch |
PERCPU_ARRAY |
Per-CPU scratch buffer for the 4 KB write_event struct (avoids the 512-byte BPF stack limit) |
Two independent mechanisms ensure zero overhead when the tool is not actively capturing:
-
Detached probes — write syscall tracepoints are attached after PIDs are registered and the global flag is set. They are detached before the flag is cleared on exit. When the tool is not running, the kernel has no knowledge of these tracepoints.
-
write_probes_enabledflag — checked as the very first instruction inshould_capture(). A singleBPF_MAP_TYPE_ARRAYread (direct indexed, not hashed) exits immediately when the value is 0. This guards against any race between flag state and probe lifetime.
The attach order is intentional to eliminate race windows:
1. Attach fork / exit / exec probes
2. Register PIDs into monitored_pids (+ scan /proc for existing descendants)
3. Set write_probes_enabled = 1
4. Attach write syscall probes
Teardown is the strict reverse:
1. Clear write_probes_enabled = 0
2. Destroy write probe links
3. Destroy lifecycle probe links
4. Close BPF object
When --include-descendants is used:
- At startup,
/procis scanned once to build a flat PID→PPid table. A BFS traversal registers every existing descendant intomonitored_pids. - During execution, the
sched_process_forktracepoint propagates the parent'sfd_maskto new children entirely in-kernel with zero latency. sched_process_execre-registers the PID afterexecve()so monitoring survives image replacement.sched_process_exitremoves the dead PID from the map when the process group leader exits.
The process comm is captured by the BPF program via bpf_get_current_comm()
at the exact moment of the write() syscall. This is immune to TOCTOU races
that would affect a /proc/<pid>/comm lookup in userspace after the fact.
-
Payload capped at 4096 bytes per syscall — larger writes are captured truncated;
orig_lenin the event records the actual requested size. -
writev/sendmsg/sendmmsgcapture first iovec only — data is read fromiov[0]only; subsequent iovecs are walked solely to accumulate the correctorig_len. This restriction exists because the BPF verifier rejects variable-offset pointer arithmetic into map values on the kernel versions targeted. The total byte count is still accurate; only the captured payload is limited to the first scatter-gather segment. -
sendmmsgcaptures first message only — only the firstmmsghdr's first iovec is read; subsequent messages in the batch are not. -
splice/sendfilecarry no payload — these syscalls transfer data between file descriptors entirely in-kernel with no userspace buffer to read. The tool records the byte count and emits an accounting line; the actual content is not available. -
x86-64 build only tested — the Makefile automatically selects the correct
-D__TARGET_ARCH_*flag for x86-64, aarch64, armv7l, and riscv64, but only x86-64 has been run and verified. The tracepoint argument struct layouts are architecture-independent (they come from the kernel's format files), so other architectures should work in principle.