A realtime memory, swap, and model-residency monitor for MLX and other local LLM workloads on Apple Silicon.
top doesn't tell you whether your 70B model is actually pinned in unified memory or quietly being paged to swap. llmtop does.
- System memory — wired / active / inactive / compressed / free, plus total RAM.
- Swap activity — used vs. total, instantaneous in/out rate, and cumulative counters. Swap-out rate is highlighted red because for an LLM workload it usually means you're about to evict weights.
- Memory pressure — kernel pressure level (
Normal/Warning/Criticalfromkern.memorystatus_vm_pressure_level), compressor and decompressor throughput in MB/s, file-backed pageout rate (red when non-zero — catches mmap'd weight eviction before swap moves), and purgeable headroom. iogpu.wired_limit_mb— the GPU wired-memory cap (autoor whatever you've sysctl'd).- Active models — unified view of resident models, with size on disk vs. resident memory and a residency
%. Pulled from:- the ollama local API (
/api/pson127.0.0.1:11434), - the omlx local API (
/v1/models/status, using the API key in~/.omlx/settings.json), - the LM Studio local API (
/api/v0/modelson127.0.0.1:1234, override withLMSTUDIO_PORT), - per-process open file descriptors for
.safetensors/.gguf/.mlxweights (authoritative — works formlx_lm,vllm,llama.cpp, LM Studio's helper process, etc.), - in-process MLX / PyTorch / llama.cpp / vLLM loads detected from mapped engine libraries (
libmlx.dylib,libtorch*,libllama, …), then attributed to the most recently accessed HuggingFace cache snapshot. Covers the common case where Python reads safetensors into the Metal heap and immediately closes the fd. - process command-line as a fallback.
- the ollama local API (
- Matching processes — top-N by RSS, filtered by name/cmdline patterns.
- Footer — short git rev of the running
llmtopcheckout (-dirtyif there are local edits).
llmtop is a single-file PEP 723 script. With uv installed, it bootstraps its own dependencies on first run — no venv to manage.
chmod +x llmtop
./llmtopOptional — drop it on your PATH:
install -m 755 llmtop /usr/local/bin/llmtop
llmtopRequires Python ≥ 3.11 and macOS (relies on vm_stat and sysctl). Built and tested on Apple Silicon.
./llmtop # live TUI, default filters, 1s refresh
./llmtop -i 0.5 # 0.5s refresh
./llmtop -m ollama -m llama # custom process filters (repeatable)
./llmtop --log run.csv # also append CSV alongside the TUI
./llmtop --jsonl run.jsonl # also append JSONL
./llmtop --no-tui --log run.csv # headless logger (great for benchmarks)
./llmtop --pane procs # show only the matching-processes pane
./llmtop --pane system --pane models # show only those two panes| Flag | Description |
|---|---|
-i, --interval |
Sample interval in seconds (default 1.0). |
-m, --match |
Substring to match against process name + cmdline. Repeatable. Defaults: python, mlx, omlx, ollama, llama, lm-studio, lmstudio, lm studio, vllm, text-generation-inference, tgi, candle. |
--log PATH |
Append a CSV row per sample (one column per metric). |
--jsonl PATH |
Append a JSON object per sample (full process + model breakdown). |
--no-tui |
Skip the TUI; print a one-line summary per tick. Use with --log/--jsonl for unattended runs. |
--pane NAME |
Show only this pane. Repeatable. Choices: system, pressure, models, procs. Defaults to all four. |
ctrl-c exits cleanly and flushes the log files.
ts, wired_mb, active_mb, compressed_mb, free_mb, swap_used_mb, swap_total_mb, swapins_total, swapouts_total, swapin_rate, swapout_rate, top_pid, top_name, top_rss_mb
One object per sample, including the full top-8 process list and every detected model with size + resident bytes. Suitable for piping into jq, DuckDB, or a notebook.
For each matched process, llmtop tries three strategies in order, most specific first:
- Open weight files via
lsof. Groups.safetensors/.gguf/.mlxfiles (≥ 50 MB) by a derived model id:- HuggingFace cache layout (
models--org--repo/snapshots/<hash>/<file>) →org/repo. - Single-file ggufs → filename.
- Otherwise → containing directory name.
- HuggingFace cache layout (
- Engine library signature + HuggingFace cache recency. When step 1 finds nothing (typical for MLX /
mlx-vlm/transformersworkloads, which read safetensors into the Metal heap and then close the file descriptor), the mapped libraries are scanned forlibmlx.dylib/mlx.metallib/libtorch*/libllama/libggml/site-packages/vllm/. The process's loaded model is then guessed as the~/.cache/huggingface/hub/models--*directory whoseatimeis most recently bumped after the process started. - Cmdline parsing. Falls back to
--model/--hf-repo/--model-pathflags.
Resident bytes are the process RSS (or, when known, the model's on-disk size, whichever is smaller); size on disk is the sum of weight files. Residency percent close to 100% means the model is fully paged in — anything lower (especially with active swap-outs) means parts are getting evicted.
When ollama, omlx, or LM Studio are running, their local APIs supply authoritative numbers for the models they host.
topandhtopshow RSS, but conflate the model with everything else the process is doing.asitopis great for power and frequency, but doesn't track which model weights are resident.llmtopis specifically about: is my model in unified memory, and is it staying there?
See SECURITY.md for how to report vulnerabilities.