Skip to content

Add MLX backend adapter for lemon-mlx-engine inference#2419

Open
bong-water-water-bong wants to merge 4 commits into
lemonade-sdk:mainfrom
bong-water-water-bong:feat/mlx-backend-10.8.1
Open

Add MLX backend adapter for lemon-mlx-engine inference#2419
bong-water-water-bong wants to merge 4 commits into
lemonade-sdk:mainfrom
bong-water-water-bong:feat/mlx-backend-10.8.1

Conversation

@bong-water-water-bong

Copy link
Copy Markdown

Summary

Adds a new mlx recipe backend to lemond that spawns lemon-mlx-engine's OpenAI-compatible HTTP inference server as a subprocess.

Architecture

lemond spawns mlx-server as a subprocess, health-checks it, and proxies /v1/chat/completions with automatic model name → filesystem path rewriting.

Files Changed

File Change
src/cpp/server/backends/mlx_server.cpp New — Backend adapter
src/cpp/include/lemon/backends/mlx_server.h New — Header
src/cpp/server/router.cpp Added mlx case
src/cpp/server/backends/backend_utils.cpp Recipe registration
src/cpp/server/system_info.cpp GPU defs (gfx115x/110X/120X + metal)
CMakeLists.txt Source file

Verified

  • Model appears in /v1/models with recipe=mlx
  • /v1/load spawns mlx-server (704ms startup)
  • Health check passes, backend watchdog active
  • Tested on gfx1151 (Strix Halo)

jeremyfowers and others added 4 commits June 24, 2026 22:25
…de-sdk#2402)

The macOS Metal asset embeds the macOS runner version (e.g.
sd-...-bin-Darwin-macOS-15.7.7-arm64.zip), which changes on every
upstream build. PR lemonade-sdk#2102 replaced the hardcoded version with a `*`
wildcard but never added code to resolve it, so the literal `*` went
into the download URL and 404'd.

Resolve the wildcard against the GitHub Releases-by-tag API before
building the download URL. No-op (zero network cost) for any asset
name without a wildcard.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds mlx_server backend that spawns lemon-mlx-engine's OpenAI-compatible
HTTP inference server as a lemond subprocess backend.

Files:
- src/cpp/server/backends/mlx_server.cpp — backend adapter
- src/cpp/include/lemon/backends/mlx_server.h — header
- Modified CMakeLists.txt — add source file
- Modified backend_utils.cpp — register 'mlx' recipe
- Modified system_info.cpp — recipe defs (Linux ROCm + macOS Metal)

Recipe name: 'mlx'
Binary: 'mlx-server' (built from source, system package install)
Protocol: OpenAI-compatible HTTP on localhost (health at /health)

Supports: AMD GPU (gfx1150/gfx1151/gfx110X/gfx120X) on Linux, Metal on macOS
- Add MLX case in router.cpp create_backend_server()
- Include mlx_server.h in router.cpp
- Fix mlx_server.cpp to remove unused options parsing
- Add test model entry in server_models.json
- Update backend_versions.json with mlx entry

Part of integrating lemon-mlx-engine as a lemond backend.
🔴 CRITICAL fixes:
- Store model_path_ and rewrite model field in forwarded requests
  (mlx-server requires filesystem path, not public model name)
- Map responses() to /v1/chat/completions (mlx-server lacks /v1/responses)

🟡 WARNING fixes:
- Pass --no-download flag to prevent unintended HF downloads
- Propagate LD_LIBRARY_PATH and ROCm_DIR to subprocess environment
- Always inherit stderr for debugging
- Add fs::is_directory() validation for model path

Verified: lemond spawns mlx-server, health check passes (908ms startup)
@github-actions github-actions Bot added the enhancement New feature or request label Jun 25, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0609837537

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"size": 19.3
},
"MLX-Llama-3.2-1B-Instruct": {
"checkpoint": "/home/bcloud/models/llama-1b",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace the machine-local MLX checkpoint path

This built-in suggested model points at /home/bcloud/models/llama-1b, which only exists on the author's machine. In Lemonade's normal load path, downloaded from server_models.json is not trusted; the model manager recomputes download status and will try to treat this checkpoint as a Hugging Face repo/cache entry before failing on other users' systems. Use a portable checkpoint or leave local MLX models to user/extra-model registration.

Useful? React with 👍 / 👎.

Comment on lines +44 to +46
std::string model_path = model_info.checkpoint();
if (model_path.empty()) {
model_path = model_info.resolved_path();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prefer resolved MLX paths after downloads

For a normal registered MLX model with a Hugging Face checkpoint, checkpoint() stays as the repo id while resolved_path() is the local cache directory populated by the model manager. Because this code always uses the non-empty checkpoint first, a successfully pulled MLX model still checks fs::exists("org/repo") and fails to load. Prefer resolved_path() when it is populated, and only fall back to checkpoint() for explicit local-path registrations.

Useful? React with 👍 / 👎.

Comment on lines +123 to +126
if (!model_path_.empty()) {
modified["model"] = model_path_;
}
return forward_request("/v1/chat/completions", modified);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rewrite MLX model fields for streaming requests

This model-path rewrite only runs for non-streaming calls. With stream: true, the server uses Router::chat_completion_stream/completion_stream, which forwards the original request body directly through forward_streaming_request; mlx-server then receives the public Lemonade model name instead of the filesystem path it requires. Override forward_streaming_request for MLX and apply the same model_path_ rewrite before forwarding.

Useful? React with 👍 / 👎.

if (!model_path_.empty()) {
modified["model"] = model_path_;
}
return forward_request("/v1/chat/completions", modified);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Translate or reject MLX Responses API requests

A /v1/responses request uses the Responses schema (input, max_output_tokens, response-shaped output), not the Chat Completions schema (messages, chat-shaped output). Forwarding it unchanged to /v1/chat/completions makes OpenAI responses.create(...) calls fail or return the wrong response shape for MLX models; the streaming path also still targets /v1/responses, which this backend says mlx-server lacks. Either perform a full request/response translation or return an unsupported-operation error.

Useful? React with 👍 / 👎.

}
"clear_bin_if_lemonade_below": "9.4.0",
"mlx": {
"system": "26aad7e"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid pinning a version for the system MLX backend

For a system backend, get_recipe_version() returns unknown because there is no Lemonade-managed version.txt, but this new expected version makes system_info mark an otherwise available mlx-server as update_required. The advertised action is lemonade backends install mlx:system, which BackendManager::install_backend() immediately no-ops for system, so users are left with a permanently update-required backend even when the binary is in PATH.

Useful? React with 👍 / 👎.

#endif

// Start process (always inherit stderr for debugging)
process_handle_ = ProcessManager::start_process(executable, args, env_vars, true, true);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass MLX environment overrides via env_vars

Here env_vars is passed as the third argument, but ProcessManager::start_process() treats that parameter as working_dir; the actual environment override parameter is the sixth vector<pair<...>>. On Linux installs that rely on LD_LIBRARY_PATH or ROCm_DIR for ROCm libraries, the spawned mlx-server will not receive those values and may fail to start even though lemond's environment is correct.

Useful? React with 👍 / 👎.

if (process_handle_.pid > 0) {
#endif
ProcessManager::stop_process(process_handle_);
process_handle_ = {nullptr, 0};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use WrappedServer handle helpers during MLX unload

MLX directly reads and writes process_handle_ while the backend watchdog/status paths access the same field through process_mutex_, and unload() does not stop the watchdog before killing the process. Once wait_for_ready() has started the watchdog, unloading an MLX model can race with get_process_handle_snapshot()/watchdog reset logic and expose a stale PID; use set_process_handle()/consume_process_handle_for_cleanup() and stop the watchdog like the other subprocess backends.

Useful? React with 👍 / 👎.

@jeremyfowers

Copy link
Copy Markdown
Member

keep in mind #2287 is coming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants