luxe

MLX-only repo maintainer for Apple Silicon. Takes any of your repos and adds features, fixes bugs, updates docs, or audits maintenance — and opens a PR.

Status: v1.8.0 shipped (2026-05-13). The cycle migrates control logic from prompts into the runtime: Track 2's pre-dispatch spec gate converts expects_zero_calls from policy-scored to capability-gated. BFCL n=1240 agent: irrelevance 100% (240/240, +9.58pp), total 90.24% (+1.85pp). SWE-bench n=75 wash with v1.7 (empty_patch ≤13 floor missed at 17; deferred to v1.9 — needs action_density gating). Track 5 taxonomy (src/luxe/agents/outcomes.py) is the observability primitive for future mechanism-level comparisons. 712 tests. v1.6.1 (substrate hardening + SpecDD Lever 2 in maintain_suite + BFCL anchors) was the previous shipped tag. See RESUME.md for active state.

Extended benchmark suite (2026-05-28): MMLU / ARC-Challenge / GSM8K / CodeNeedle / Perplexity added as a broad-capability layer on top of the agentic suite. Implementation in benchmarks/{gsm8k,codeneedle,mmlu,arc_challenge,perplexity}/; shared utilities in benchmarks/_eval_common/; suite runner at scripts/run_eval_suite.sh. See benchmarks/EXTENDED_BENCH.md for design + usage. 102 offline tests added; existing agentic-suite code paths unchanged.

What luxe does

luxe maintain <repo> "<goal>"
  ↓
single capable model + full tool surface (read / write / shell / git / search)
  ↓
agentic loop bounded by max_steps; .sdd contracts enforced tool-side
  ↓
diff-aware citation lint (zero unresolved fabrications)
  ↓
git checkout -b → commit → tests → push → gh pr create

Mono-only execution. Earlier swarm/micro/phased modes were retired (see src/luxe/luxe.sdd). The capable monolith runs the whole task; the SpecDD .sdd chain is what scales the system to large repos and constrains behavior, not multi-agent orchestration.

SpecDD `.sdd` chain

Every directory of consequence carries a <dir>/<dir>.sdd contract listing Must / Must not / Owns / Forbids / Forbids creating. The chain is walked by find_all_sdd at task start, surfaced into the prompt, and enforced by the _write_file / _edit_file tools at write time.

src/luxe/luxe.sdd — root invariants (mono-only, temp=0, pinned --work-dir, no MoE Instruct-2507, no origin/<branch> reads on offline cache)
src/luxe/agents/agents.sdd — prompt registry is the single source of truth
src/luxe/tools/tools.sdd — honesty guards, Forbids enforcement order
benchmarks/maintain_suite/maintain_suite.sdd — bench rules (vacuous_test gates, --keep-loaded, sidecar regrade)

Forbids creating (v1.6) fires only when a write would create a new file — v1.5 broad-glob path-aware semantics gave way to operation-aware semantics so legitimate edits to existing files aren't caught by scaffolding-name patterns. See RESUME.md §Architectural reframe for the full rationale.

Why MLX-only

Earlier multi-backend versions (Ollama / llama.cpp / LM Studio / oMLX / MLX) produced repeated real failures: silent context truncation, fabricated citations, model-loop bugs. luxe ships oMLX-only — every other moving part is one less thing that can lie about token budgets.

Install

python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Requires:

Python 3.11+
A running oMLX server on localhost:8000. The brew launchd unit's KeepAlive is recommended (oMLX/Metal has occasional gpu::check_error crashes; auto-restart hides them).
gh CLI authenticated (gh auth login) for the PR cycle
Apple Silicon with ≥64 GB unified memory

CLI

luxe maintain <repo> "<goal>" [--task review|implement|bugfix|document|summarize|manage]
                              [--config <path>] [--allow-dirty] [--yes]
                              [--watch-ci] [--keep-loaded]
                              [--spec-yaml <path>] [--save-report]
luxe chat   [--repo <path>] [--config <path>]   # interactive Claude-CLI-style agent
            [--chat-model/--plan-model/--code-model <id>] [--resume <id>]
luxe pr     <run-id> [--push-only]              # resume a partially-completed PR cycle
luxe runs   list | luxe runs gc                 # housekeeping
luxe unload [--except <model-id>]               # free oMLX RAM (auto-runs after maintain)
luxe serve  [--transport stdio|sse] [--unsafe]  # MCP server (read-only by default)
luxe check                                      # oMLX + models + gh auth

Examples:

# Default — single capable model, agentic loop
luxe maintain ~/code/my-app "fix the off-by-one in pagination"

# Read-only review (no PR)
luxe maintain ~/code/my-app "review the auth module for security bugs" --task review

# Resume just the PR cycle (commit / push / create / watch_ci) after auth expired
luxe pr <run-id>

Interactive chat tiers

luxe chat is the same champion (Qwen3.6-35B-A3B-6bit) wrapped in the agentic loop. How much luxe harness sits on top of the model is a spectrum controlled by the LUXE_* substrate flags (src/luxe/agents/loop.py) and the role's prompt overlay. Three shortcuts pin the useful points (the wrappers live in ~/dotfiles/bin, synced across hosts; they assume luxe at ~/Downloads/luxe, override with LUXE_HOME):

Command	Substrate (`loop.py`)	Prompt / SpecDD	Config
`luxe`	tiered-compaction on only (shipped default)	`manage_strict_only` overlay	`configs/chat.yaml`
`luxe-bare`	all interventions off	baseline prompts, no SDD	`configs/chat_bare.yaml`
`luxe-full`	all validated levers on	`manage_strict_only` overlay	`configs/chat.yaml`

luxe-bare is the "plain Claude-CLI clone" — the raw champion. It exports LUXE_TIERED_COMPACT=0 LUXE_REFLECT=0 LUXE_ADAPTIVE_POLICY=0 LUXE_WRITE_PRESSURE=0 LUXE_EARLY_BAIL=0 LUXE_PROSE_BURST=0 LUXE_ACTION_DENSITY_GATE=0 LUXE_CONVERGENCE_GATE=0 LUXE_REPROMPT_ON_DOC=0 and points chat at configs/chat_bare.yaml (identical to chat.yaml minus the task_overlay_id: manage_strict_only line → RoleConfig baseline prompts). Equivalent to luxe compare mode-1's "bare champion" side.
luxe-full flips every validated lever to 1. The three default-OFF refuted experimental flags (LUXE_RESPOND_TERMINAL, LUXE_EARLY_BAIL_TRAJECTORY_SHAPE, LUXE_EARLY_BAIL_COMMIT_ONLY) stay off.
The model weights never change — bare vs full is purely harness scaffolding.
chat starts read-only; type /write in the REPL to enable edits + bash.

Chat commands

/help lists them all. Beyond the slot/model/context controls (/model, /use, /ctx, /write, /bash, /sys, /memory, /resume, /clear):

Startup flags (so autonomous /goal users don't have to type REPL commands first): luxe chat --verbose diff|full, --show-reasoning, --no-terse, --debug (= verbose full + reasoning), and --theme auto|cool|warm|mono (curated palettes). The shipped default palette is cool, resolved --theme flag → LUXE_THEME env → cool; set LUXE_THEME=auto (or --theme auto) to track your terminal/YASL theme instead. The banner shows the build's git short-SHA so a run is traceable to a commit. /ctx huge reaches a 256K window where the box's num_ctx_max allows it (default window stays 32K).

Git & MCP: native read-only git_diff/log/show tools are always available; for richer git, add the commented git MCP server in configs/mcp.yaml (it's auto- namespaced mcp__git__<tool>). luxe has no large-repo chunking yet — it relies on BM25 search + the symbol index + tiered context compaction; repo-splitting for big refactors is future work (docs/g1-context-lifecycle-design.md).

Output verbosity is three independent toggles, not a single dial:

Command	What it does
(default)	One terse line per tool call: `→ tool(arg) ✓ <bytes>`.
`/verbose [diff\|full\|off]`	Expand tool I/O: `diff` shows `edit_file` as a highlighted unified diff, `write_file` headers, and tool result/error bodies (capped); `full` syntax-highlights whole file contents. Also renders the working-state ledger each turn. Bare `/verbose` toggles off↔diff.
`/reasoning`	Stream the model's thinking live (dim) between tool calls. Independent of `/verbose`; responsiveness tracks the backend's streaming cadence.
`/debug`	Convenience: turns on `/verbose full` + `/reasoning` together ("show me everything"); toggles both back off.
`/terse`	Toggle terse model output (default ON). Injects a "report only deltas" instruction to cut wordy prose and save tokens; never abbreviates tool output or errors.

The LUXE_* env flags + [token-progress] logging in agents/loop.py are a separate lower-level debug layer (see that module).

Command What it does

/goal <objective> · /goal stop Autonomous runner: round 1 = objective, later rounds = continue work, until the objective is reached, the round budget (20) is hit, the agent is stuck, or 3 consecutive crashes. Completion is ledger-aware: a settled round (no edits) is only DONE when the ledger corroborates (completed non-empty, in_progress cleared) for 2 rounds; settled rounds that record no new completed work trip an honest "stuck — needs a human" exit. Each round prints [goal round N/M]. Needs /write; Ctrl-C or /goal stop halts it.

/plan <objective> Draft an implementation plan read-only (no edits), then choose: save to a file (never clobbers an existing plan.md), execute it (hands off to the /goal runner with the plan as provenance context), both, or discard.

Working-state ledger. Across continue work / /goal rounds, luxe keeps a compact per-session ledger (~/.luxe/sessions/<id>/ledger.json) of decided / completed / in-progress / blocked items plus files written — injected as a <working_state> block so the model trusts known state instead of re-reading plan.md + every source each round (the dominant token sink at small context windows). The model maintains it via the update_ledger tool; files written/edited are tracked automatically.

Interrupting. Ctrl-C cancels mid-generation (not only at tool boundaries) and saves the partial turn. A long-running bash command finishes first.

Static analyzers (lint/typecheck/security_scan/deps_audit) resolve their binary via PATH → python -m → uvx (no auto-install). When a tool is genuinely unavailable they return a structured {"status":"skipped",…} result rather than an error, so a missing linter is never misread as "passed". Install the toolchain with pip install -e '.[analyzers]'.

Theme. The chat UI (tool lines, status bar, ledger, banner, prompt arrows) draws its colors from theme roles resolved in chat/theme.py, which follow your active yet-another-statusline theme (CLAUDE_STATUSLINE_THEME → ~/.claude/statusline-theme) and otherwise fall back to ANSI-named colors that track your terminal/iTerm profile — so luxe matches your terminal instead of a fixed palette.

Production model

The monolith is configured in configs/single_64gb.yaml (see the file for the exact pin and the rollback alternate). temp=0.0 is mandatory — the v1.0 variance probe showed temp=0.2 produced ±2-fixture swings between identical runs. src/luxe/luxe.sdd Forbids the MoE Instruct-2507 family (long-context fabrication + skipped optional-tool calls).

Bench overlays:

configs/single_64gb_swebench.yaml — SWE-bench A/B
configs/single_64gb_swebench_counterexample.yaml — counterexample probe

Resilience

Body-aware backend retry. Distinguishes loading / swapping / warming (retry with exponential backoff) from unavailable / crashed / oom (fail fast). 3 attempts max.
Per-stage checkpoints. Stage outputs persisted under ~/.luxe/runs/<id>/stages/. HEAD-vs-base_sha drift detection blocks accidental resume after the repo has moved.
Concurrency lock. ~/.luxe/locks/<sha256(repo_abs_path)>.lock — two parallel runs on the same repo fast-fail with the holding PID. Auto- releases on holder death.
PR step ledger. commit / test / push / create / watch_ci each checkpointed. Auth expired between push and PR-create? luxe pr <run-id> picks up where it left off.
Diff-aware citation linter. Build-breaking gate. Forgives line shifts via fuzzy snippet match within ±20 lines on edited files.
.sdd Forbids enforcement. Write tools check Forbids (any operation) and Forbids creating (creates only) against the resolved .sdd chain before mutating the tree; violations raise distinct error messages so the model can reroute rather than bail.

Repo-size scaling

Two retrieval indices, built once per session:

BM25 (bm25_search) — rank_bm25 over source files; tokenizer splits on non-alphanumerics AND camelCase. Better than grep for natural- language queries ("where is auth middleware applied?").
AST symbols (find_symbol) — tree-sitter for Python, JavaScript, TypeScript, Rust, Go. Exact lookup ("show me class UserService"). Returns a clear note pointing back to BM25 when the language isn't covered, so the agent never silently sees zero matches on Java/Ruby/etc.

The repo summary surfaces symbol_index_coverage: {language: file_count} so the model knows which queries have AST coverage and which fall back to BM25.

MCP

luxe is both an MCP client and an MCP server.

As client — opt-in via configs/mcp.yaml. Per-call timeout (30s), 3-fail circuit breaker, per-server soft cap (50/run) and global hard cap (200/run). Stdio subprocess lifetime owned by the manager; SIGTERM then SIGKILL on close.

As server — luxe serve exposes three read-only tools by default (luxe_review, luxe_summarize, luxe_explain); mutation (luxe_maintain) is gated behind THREE locks (the --unsafe flag at boot, LUXE_MCP_UNSAFE=1 env at call time, and a confirm_token matching env-set LUXE_MCP_TOKEN). Per-tool rate limits. Audit log at ~/.luxe/mcp_audit.jsonl with secrets redacted.

Benchmarks

Three benches live in benchmarks/. Use them in this order:

`maintain_suite` — 10-fixture acceptance harness

The original v1.0 gate (≥8/10) was met in v1.4 and the suite is now used as a regression / variance probe across prompt and config changes.

# Run all fixtures (resumable)
.venv/bin/python -m benchmarks.maintain_suite.run --all \
    --work-dir ~/.luxe/bench-workspace --keep-loaded

# One specific fixture
.venv/bin/python -m benchmarks.maintain_suite.run --id <fixture-id> \
    --work-dir ~/.luxe/bench-workspace --keep-loaded

# Re-run errored / skipped fixtures, or force a re-run
.venv/bin/python -m benchmarks.maintain_suite.run --all --retry-errors
.venv/bin/python -m benchmarks.maintain_suite.run --all --retry-skipped
.venv/bin/python -m benchmarks.maintain_suite.run --id <id> --force

Fixture format in benchmarks/maintain_suite/fixtures.yaml:

fixtures:
  - id: my-typo-fix
    repo_path: ~/code/my-blog
    base_sha: a1b2c3d4...
    goal: "fix the typo in the README"
    task_type: bugfix
    expected_outcome:
      kind: regex_present
      pattern: "previously misspelled phrase"

  - id: feature-rate-limit
    repo_url: https://github.com/me/some-public-repo
    base_sha: e5f6...
    goal: "add a rate limiter to the /signup endpoint"
    task_type: implement
    expected_outcome:
      kind: tests_pass
      command: "pytest tests/test_rate_limit.py -q"

expected_outcome.kind ∈ regex_present / regex_absent / tests_pass / manual_review. --variants <yaml> runs the same fixture set across multiple (prompt, config) cells; results land under acceptance/<output>/<variant_id>/<fixture_id>/ and the harness emits a comparison.json plus a printed table.

Pin --work-dir. Random tempdirs leak into prompts and dominate temp=0 variance — see project_workdir_variance_leak.

The runner has three layers of recovery — kill it any time, restart picks up:

Per-fixture status in acceptance/<id>/state.json. pending → running → done | error | skipped.
Per-stage checkpoints at ~/.luxe/runs/<run-id>/stages/. The runner calls luxe pr instead of restarting the whole pipeline when a running fixture has a saved luxe_run_id.
PR-step ledger at ~/.luxe/runs/<run-id>/pr_state.json.

`swebench` — SWE-bench Verified A/B (active v1.6 ship gate)

Pre/post-SpecDD comparison on the curated n=75 subset (benchmarks/swebench/subsets/v1_baseline_n75.json). Pre-SpecDD baseline is in acceptance/swebench/pre_specdd_v141_n75/; v1.6 v3 is the next target.

brew services restart omlx && sleep 5 && \
LUXE_LOG_TOOL_CALLS=1 OMLX_API_KEY=omlx-sdb25582k3mq8pf9 \
  .venv/bin/python -m benchmarks.swebench.run \
    --subset benchmarks/swebench/subsets/v1_baseline_n75.json \
    --output acceptance/swebench/post_specdd_v16_creation_only_n75/rep_1/

# Compare two runs (verdict deltas + escape audit)
.venv/bin/python -m benchmarks.swebench.compare_runs \
    --pre  <pre>/predictions.json --post <post>/predictions.json \
    --gold-source benchmarks/swebench/subsets/raw/verified.jsonl

# Inspect a single run (verdict tally, new_file_in_diff escapes)
.venv/bin/python -m benchmarks.swebench.smoke_inspect \
    --predictions <run>/predictions.json \
    --gold-source benchmarks/swebench/subsets/raw/verified.jsonl

The adapter binds LUXE_WRITE_PRESSURE=1 and disables commit.gpgsign automatically; no shell env munging needed beyond OMLX_API_KEY.

`bfcl` — Berkeley Function-Call Leaderboard v3

Tool-call evaluation across the Python subset (n=1240). Two anchors filed:

Run	Total	Parallel cliff (parallel / parallel_multiple)	Irrelevance	Wall
Pre-SpecDD raw (v1.4.1, 2026-05-04)	76.29%	66.00% / 49.00%	91.67%	~6.7h
Post-SpecDD raw (v1.6, 2026-05-10)	76.45%	65.50% / 48.00%	92.08%	~6.1h
Post-SpecDD agent (v1.6, 2026-05-11)	83.71%	82.50% / 64.50%	85.83%	~8.5h

Raw-mode delta v1.4.1→v1.6 is +0.16pp — no infra drift across SpecDD ship. Agent-mode adds +7.26pp over raw, with the parallel cliff lifting +16–17pp. Caveat: agent-mode irrelevance regresses −6.25pp (loop primes tool-eagerness); BFCL agent adapter does NOT wire .sdd or the Lever 1 spec validator yet, so the lift is loop-vs-single-shot, not SpecDD-driven.

.venv/bin/python -m benchmarks.bfcl.run --mode raw   --output <dir>   # ~6h
.venv/bin/python -m benchmarks.bfcl.run --mode agent --output <dir>   # ~8.5h

Layout

luxe/
├── configs/
│   ├── single_64gb.yaml                          # default mono config
│   ├── single_64gb_swebench.yaml                 # SWE-bench A/B overlay
│   ├── single_64gb_swebench_counterexample.yaml  # counterexample probe
│   ├── mcp.yaml                                  # MCP client + server policy
│   └── pr.yaml                                   # PR cycle config
├── src/luxe/
│   ├── cli.py                  # luxe maintain | pr | runs | serve | check | unload
│   ├── luxe.sdd                # root architectural contract
│   ├── pr.py                   # branch → commit → test → push → PR
│   ├── locks.py                # per-repo flock
│   ├── run_state.py            # RunSpec, stage checkpoints, PR ledger, events
│   ├── citations.py            # diff-aware citation linter
│   ├── search.py               # BM25 retrieval
│   ├── symbols.py              # tree-sitter AST symbols
│   ├── repo_index.py           # repo summary + symbol coverage
│   ├── backend.py              # oMLX client (body-aware retry)
│   ├── sdd.py                  # .sdd chain parser
│   ├── spec.py                 # SpecDD Lever 1 spec parser
│   ├── spec_resolver.py        # .sdd resolution + Forbids/Forbids-creating eval
│   ├── spec_validator.py       # spec-vs-diff per-requirement validator
│   ├── agents/                 # loop.py, single.py, prompts.py (registry)
│   ├── tools/                  # fs (write/edit + Forbids gate), git, shell, analysis, cve_lookup
│   └── mcp/                    # client, server, bridge
├── benchmarks/
│   ├── maintain_suite/         # 10-fixture acceptance harness + variants
│   ├── swebench/               # SWE-bench Verified n=75 A/B
│   └── bfcl/                   # BFCL v3 Python subset
├── tests/                      # 643 tests
├── RESUME.md                   # current project state, active task, ship gates
└── lessons.md                  # postmortems for every historical surprise

Testing

pytest tests/ -v

643 tests covering: agent loop, mono-mode prompt registry, diff-aware citation linter (incl. line-shift forgiveness on edited files), backend body-aware retry, PR cycle (preflight, empty-diff semantics, resume from each step), per-repo flock, run-state checkpointing + drift detection, MCP client (timeouts, circuit breaker, caps), MCP server (read-only-by-default, token gate, audit log), BM25 indexing, tree-sitter symbol indexing across 5 languages, repo summary coverage transparency, SpecDD .sdd parsing

resolution + Forbids / Forbids-creating semantics, SpecDD Lever 1 spec parsing + per-requirement validation, write-pressure loop, BFCL adapter + schemas + grader, SWE-bench adapter, acceptance grader scoring rules, bench runner resumption decisions.

Tests run without a live oMLX server (HTTP transport mocked).

Research notes

External-project teardowns kept for cross-pollination. These also touch sibling projects (micro-mind, mage-hands), not luxe alone.

docs/research/forge-overlap-analysis.md — forge ↔ luxe ↔ micro-mind overlap + candidate port items.
docs/research/hermes-harvest-backlog.md — Hermes Agent (Nous Research) feature backlog harvest.

Project state

RESUME.md — current state, active task, exact launch commands. Read first.
lessons.md — postmortems for every historical surprise. Read before proposing architectural changes.
CLAUDE.md — Claude Code instructions (the .sdd chain reading order, what's retired, what's bench-as-truth).

License

LICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

luxe

What luxe does

SpecDD `.sdd` chain

Why MLX-only

Install

CLI

Interactive chat tiers

Chat commands

Production model

Resilience

Repo-size scaling

MCP

Benchmarks

`maintain_suite` — 10-fixture acceptance harness

`swebench` — SWE-bench Verified A/B (active v1.6 ship gate)

`bfcl` — Berkeley Function-Call Leaderboard v3

Layout

Testing

Research notes

Project state

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.claude		.claude
.github/workflows		.github/workflows
acceptance/eval_suite/baselines		acceptance/eval_suite/baselines
benchmarks		benchmarks
configs		configs
docs		docs
proposal		proposal
scripts		scripts
src/luxe		src/luxe
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
RESUME.md		RESUME.md
agents.md		agents.md
architecture.md		architecture.md
lessons.md		lessons.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

luxe

What luxe does

SpecDD .sdd chain

Why MLX-only

Install

CLI

Interactive chat tiers

Chat commands

Production model

Resilience

Repo-size scaling

MCP

Benchmarks

maintain_suite — 10-fixture acceptance harness

swebench — SWE-bench Verified A/B (active v1.6 ship gate)

bfcl — Berkeley Function-Call Leaderboard v3

Layout

Testing

Research notes

Project state

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

SpecDD `.sdd` chain

`maintain_suite` — 10-fixture acceptance harness

`swebench` — SWE-bench Verified A/B (active v1.6 ship gate)

`bfcl` — Berkeley Function-Call Leaderboard v3

Packages