A lean manager for local llama.cpp models — discover what's on disk, launch each with a hardware-tuned config, switch between them from a web UI, and point your agent at whichever one is running. Built for people running Hermes (or any OpenAI-compatible agent) against their own hardware — without the weight of Ollama or LM Studio.
lmm is deliberately small: it manages the llama-server instances you already run, rather than bundling its own model runtime, registry, or chat UI. If you have llama.cpp, your GGUF files, and an agent, lmm is the thin layer that ties them together.
lmm lets you:
- Discover every local model on disk, classified from its GGUF header (not from file names).
- Get a recommended
llama-serverconfig for any model — computed from the GGUF metadata and the machine's hardware (cores, usable RAM, Metal/GPU), with a fit-check that warns when a model won't comfortably fit in RAM and names the culprit. - Start / stop / switch the running model from a web UI or the CLI.
- Connect an agent (Hermes, or any OpenAI-compatible app) to the running model — one click on the host.
- Run as an always-on system service with a token-gated HTTP control plane, drivable from any machine on your LAN.
The Hermes connection. "Connect an agent" repoints Hermes at the running model — it registers your local server as a custom OpenAI-compatible provider and sets it as Hermes's default, so your agent talks to your local model and follows along when you switch. One click on the host; a single lmm bind from any other machine. (Any OpenAI-compatible app works too — the UI shows the base URL and model id to paste.)
⚠️ lmmmust run on the same machine as llama.cpp and your models. The daemon spawns and supervisesllama-serverand reads the GGUF files off local disk, so it has to live on the model host — it cannot be pointed at a remote llama.cpp. Browsers, thelmmCLI in client mode, and agents can run on any LAN machine; only the daemon is pinned to the host. See How it runs.
Status: backend, CLI, daemon, and web UI are implemented and tested. Interfaces are pre-1.0 and may change. See ARCHITECTURE.md and ROADMAP.md.
Run everything below on the machine that holds your model files and will run llama.cpp — lmm has to live there (see the co-location note above).
uv — manages Python ≥ 3.11 for you:
curl -LsSf https://astral.sh/uv/install.sh | shllama.cpp — provides llama-server, which must be on your PATH. If you don't already have it, the simplest route on macOS is Homebrew:
brew install llama.cppOn other platforms install llama.cpp any way you like (package manager or build from source); lmm only needs the llama-server binary on your PATH.
git clone https://github.com/mdorf/local-model-manager.git
cd local-model-manager
uv tool install . --compile-bytecode # installs the `lmm` command onto your PATHConfirm it's available — this should print the help:
lmm --helpIf
lmmisn't found,uv's tool directory isn't on yourPATH. Runuv tool update-shell, then open a new terminal.Why
--compile-bytecode? It pre-compiles the.pyccaches now (owned by you). Without it, the first time you run a privilegedsudo lmm …command, Python would write those caches as root into your tool directory — which then blocks lateruv tool install --force. Pre-compiling avoids that. (To reinstall the command later, useuv tool install . --force --compile-bytecode.)
These two steps (clone + uv tool install .) are the shared foundation for both ways of running the daemon below. Foreground and always-on are alternatives — pick one; the clone can stay where it is, the installed service builds its own copy.
Choose one of the two modes.
Runs in your terminal; stops when you close it. No sudo, no system changes — ideal for a first look.
lmm daemon # models in ~/models
LMM_MODELS_DIR=/path/to/models lmm daemon # or point it elsewhere (persisted after first run)Then open the web UI:
To stop: press Ctrl-C in that terminal. (That stops the daemon. Any model you started keeps running — stop it with the Stop button in the UI, or lmm stop.)
Installs the daemon as a launchd LaunchDaemon: it starts at boot, restarts on crash, and runs as you. macOS only for now (a Linux/systemd installer is on the roadmap).
Run this from inside your clone — --project-dir "$(pwd)" tells the installer where to build the daemon's copy from.
Why
sudo "$(command -v lmm)"and not justsudo lmm?sudoresetsPATHto a minimal system default, so it can't findlmm(which lives in~/.local/bin).$(command -v lmm)is expanded by your shell first, handingsudothe absolute path. The installer then locatesuv/llama-serverfor you. (If those live somewhere non-standard, fall back tosudo env "PATH=$HOME/.local/bin:/opt/homebrew/bin:$PATH" lmm install ….)
sudo "$(command -v lmm)" install --project-dir "$(pwd)" --models-dir /path/to/models# preview the exact privileged steps without changing anything:
sudo "$(command -v lmm)" install --dry-run --project-dir "$(pwd)"
# rebuild in place later — also from the clone (keeps your token + state):
sudo "$(command -v lmm)" install --reinstall --project-dir "$(pwd)"The UI is then at http://127.0.0.1:8770, with no terminal attached.
Manage the service:
lmm service status # installed? responding? (read-only — no sudo)
sudo "$(command -v lmm)" service stop # stop it (reloads on next boot)
sudo "$(command -v lmm)" service start # (re)load it
sudo "$(command -v lmm)" service restartTo stop the daemon: sudo "$(command -v lmm)" service stop. Stopping or restarting the service leaves any running model up — only the control plane bounces, and it re-adopts the model on restart. To stop the model itself, use the UI Stop button or lmm stop.
By default the daemon binds loopback (127.0.0.1), so it's reachable only on the host itself — the safe default. To drive it from another machine on your network (a laptop, say), reinstall with --host 0.0.0.0:
sudo "$(command -v lmm)" install --reinstall --host 0.0.0.0 \
--project-dir "$(pwd)" --models-dir /path/to/models
lmm token # prints the bearer token to paste on the clientThen on the other machine open http://<host-ip>:8770 (e.g. http://192.168.1.78:8770) and paste the token once. (If macOS prompts to allow incoming connections, allow it.)
Reaching the model from another machine. A model you start (or Reload) while the daemon is LAN-bound is automatically reachable at http://<host-ip>:8080 — it inherits the daemon's host and gets an inference api-key. To point an agent at it from another machine, open "Connect an agent" in the UI: it shows the exact base URL (http://<host-ip>:8080/v1), the model id, and the api-key to paste into Hermes or any OpenAI-compatible client. (A model that was already running stays on its original host until you Reload it.)
⚠️ Only do this on a network you trust. The control plane is token-gated and inference gets an--api-keyautomatically when LAN-exposed — but the daemon runs as you and spawns processes, so a compromise of the network-facing daemon carries your account's privileges. Keep it on a LAN you control; never port-forward it to the internet. See Security.
Open http://127.0.0.1:8770. On the local host the daemon injects its auth token automatically, so it just loads.
Pick a model in the sidebar to see its recommended config and RAM fit-check. You can change the context length, edit the launch flags, then Start it.
Once a model is running, watch its live logs in the drawer, and use Switch to change models or Stop to unload.
Click Connect an agent to point Hermes at the running model in one click (host only), or copy the OpenAI-compatible base URL + model id to use from any other app.
These steps remove everything lmm creates — service, CLI, state, and the Hermes binding — leaving the machine as if it were never installed. (lmm only ever reads your model files, so they're never touched.) Run them in order; steps 1–2 use lmm, so do them before step 4 removes it.
# 1. Revert the Hermes binding (restores your pre-bind config and deletes the
# backup). A no-op if you never connected an agent.
lmm unbind
# 2. Remove the always-on service — LaunchDaemon, shared state (token/venv/logs),
# the /Library/Logs dir, and the firewall rule. Skip if you only ever ran the
# foreground daemon (it installs nothing).
sudo "$(command -v lmm)" uninstall
# 3. Remove foreground/dev-mode state — only exists if you ran `lmm daemon` directly.
rm -rf "$HOME/Library/Application Support/local-model-manager"
# 4. Remove the `lmm` command.
uv tool uninstall local-model-manager
# 5. Remove the cloned source (and its build venv).
rm -rf <path-to-your-clone>After these steps, nothing of lmm remains on the machine.
lmm is one distributable with two roles: a host that manages models, and clients that drive it.
| Component | Where it runs |
|---|---|
lmm daemon (:8770) |
On the model host — it spawns llama-server and reads the GGUF files, so it must live where the models are. |
llama-server (:8080) |
On the host, spawned by the daemon. |
Web UI / lmm client |
Any LAN machine — a browser, or lmm in client mode. |
| Hermes / any agent | Anywhere — it just needs HTTP access to the host's :8080/v1. |
LAN
┌─────────────────┐ ┌─ Host ───────────────────────────────────┐
│ Browser / CLI │◀── HTTP ─▶│ lmm daemon :8770 │
│ (client) │ │ │ spawns / supervises │
└─────────────────┘ │ ▼ │
┌─────────────────┐ │ llama-server :8080 (/v1) │
│ Hermes │─── /v1 ──▶│ (inference) │
│ (any LAN box) │ │ models on disk — host-side only │
└─────────────────┘ └──────────────────────────────────────────┘
The daemon binds loopback (127.0.0.1) by default; pass --host 0.0.0.0 to expose it on the LAN (see Security).
Everything in the UI is on the CLI too. With a daemon running, serve / stop / status / switch route through it; otherwise they act locally.
lmm models --root /path/to/models # list discovered models
lmm recommend Qwen3.6-27B-Q8_0 --root /path/to/models # tuned config + fit-check
lmm serve Qwen3.6-27B-Q8_0 --root /path/to/models # start (default :8080); waits for /health + smoke test
lmm status # show managed servers
lmm switch Other-Model --root /path/to/models # stop current, start another
lmm stop --port 8080 # stop the model server (not the daemon)
lmm bind Qwen3.6-27B-Q8_0 --port 8080 # point ~/.hermes/config.yaml at the running model
lmm bind --host other-host.local --port 8080 # bind to a model on another host (omit model to auto-detect)
lmm unbind # revert from the pre-bind backupbind registers a custom provider and sets the default model, preserving your config's comments and other keys (reasoning models like Qwen3.6 want a generous max_tokens — bind prints a reminder).
The daemon serves both the web UI and a token-gated API. All endpoints require Authorization: Bearer <token> except /api/health; on loopback the UI gets the token injected, remote clients paste it once (lmm token prints it). The token doesn't expire; if it ever leaks, rotate it with lmm token --rotate (then restart the daemon and re-enter it on each client).
| Method & path | Purpose |
|---|---|
GET /api/health |
liveness (open, no auth) |
GET /api/models |
list discovered models |
GET /api/models/{name}/recommend |
recommended config + fit for a model |
GET /api/servers |
list running/managed servers |
POST /api/servers |
start a server — body { "model": "...", "port": 8080 } |
POST /api/servers/switch |
switch the running model |
DELETE /api/servers/{port} |
stop the server on a port |
GET /api/connection-info |
base URL / model / inference key for connecting an agent |
POST /api/bind |
bind the host's Hermes to the running model (loopback only) |
GET /api/bind-status |
whether the host's Hermes points at the running model |
WS /api/stream |
live server logs + status (subprotocol lmm.bearer.<token>) |
The daemon detects an already-running llama-server on startup and reflects it in the UI; stopping or restarting the daemon does not stop the model. The control API is pre-1.0 and evolving — treat it as unstable.
The daemon runs as your user (so it can read your models and bind Hermes in one click) and binds loopback by default, which keeps it low-risk for personal use.
If you expose it to the LAN (--host 0.0.0.0):
- The control plane (
:8770) is gated by a shared bearer token — it spawns processes, so the threat model is everything else on the network, not your trusted clients. - The inference plane (
:8080) is gated byllama-server --api-key(a secret distinct from the control token). - A compromise of the network-facing daemon would carry your account's privileges — a deliberate tradeoff for one-click binding and zero-setup model access.
One-click "Connect an agent" is loopback-only — the daemon can only write the host's own ~/.hermes. To connect an agent on a different machine, run lmm bind --host <host> --port 8080 … there (the UI shows the exact command).
uv run pytest -q # run the test suite
uv run ruff check . # lintMIT © 2026 Misha Dorf



