Add MLX backend adapter for lemon-mlx-engine inference#2419
Add MLX backend adapter for lemon-mlx-engine inference#2419bong-water-water-bong wants to merge 4 commits into
Conversation
…de-sdk#2402) The macOS Metal asset embeds the macOS runner version (e.g. sd-...-bin-Darwin-macOS-15.7.7-arm64.zip), which changes on every upstream build. PR lemonade-sdk#2102 replaced the hardcoded version with a `*` wildcard but never added code to resolve it, so the literal `*` went into the download URL and 404'd. Resolve the wildcard against the GitHub Releases-by-tag API before building the download URL. No-op (zero network cost) for any asset name without a wildcard. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds mlx_server backend that spawns lemon-mlx-engine's OpenAI-compatible HTTP inference server as a lemond subprocess backend. Files: - src/cpp/server/backends/mlx_server.cpp — backend adapter - src/cpp/include/lemon/backends/mlx_server.h — header - Modified CMakeLists.txt — add source file - Modified backend_utils.cpp — register 'mlx' recipe - Modified system_info.cpp — recipe defs (Linux ROCm + macOS Metal) Recipe name: 'mlx' Binary: 'mlx-server' (built from source, system package install) Protocol: OpenAI-compatible HTTP on localhost (health at /health) Supports: AMD GPU (gfx1150/gfx1151/gfx110X/gfx120X) on Linux, Metal on macOS
- Add MLX case in router.cpp create_backend_server() - Include mlx_server.h in router.cpp - Fix mlx_server.cpp to remove unused options parsing - Add test model entry in server_models.json - Update backend_versions.json with mlx entry Part of integrating lemon-mlx-engine as a lemond backend.
🔴 CRITICAL fixes: - Store model_path_ and rewrite model field in forwarded requests (mlx-server requires filesystem path, not public model name) - Map responses() to /v1/chat/completions (mlx-server lacks /v1/responses) 🟡 WARNING fixes: - Pass --no-download flag to prevent unintended HF downloads - Propagate LD_LIBRARY_PATH and ROCm_DIR to subprocess environment - Always inherit stderr for debugging - Add fs::is_directory() validation for model path Verified: lemond spawns mlx-server, health check passes (908ms startup)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0609837537
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "size": 19.3 | ||
| }, | ||
| "MLX-Llama-3.2-1B-Instruct": { | ||
| "checkpoint": "/home/bcloud/models/llama-1b", |
There was a problem hiding this comment.
Replace the machine-local MLX checkpoint path
This built-in suggested model points at /home/bcloud/models/llama-1b, which only exists on the author's machine. In Lemonade's normal load path, downloaded from server_models.json is not trusted; the model manager recomputes download status and will try to treat this checkpoint as a Hugging Face repo/cache entry before failing on other users' systems. Use a portable checkpoint or leave local MLX models to user/extra-model registration.
Useful? React with 👍 / 👎.
| std::string model_path = model_info.checkpoint(); | ||
| if (model_path.empty()) { | ||
| model_path = model_info.resolved_path(); |
There was a problem hiding this comment.
Prefer resolved MLX paths after downloads
For a normal registered MLX model with a Hugging Face checkpoint, checkpoint() stays as the repo id while resolved_path() is the local cache directory populated by the model manager. Because this code always uses the non-empty checkpoint first, a successfully pulled MLX model still checks fs::exists("org/repo") and fails to load. Prefer resolved_path() when it is populated, and only fall back to checkpoint() for explicit local-path registrations.
Useful? React with 👍 / 👎.
| if (!model_path_.empty()) { | ||
| modified["model"] = model_path_; | ||
| } | ||
| return forward_request("/v1/chat/completions", modified); |
There was a problem hiding this comment.
Rewrite MLX model fields for streaming requests
This model-path rewrite only runs for non-streaming calls. With stream: true, the server uses Router::chat_completion_stream/completion_stream, which forwards the original request body directly through forward_streaming_request; mlx-server then receives the public Lemonade model name instead of the filesystem path it requires. Override forward_streaming_request for MLX and apply the same model_path_ rewrite before forwarding.
Useful? React with 👍 / 👎.
| if (!model_path_.empty()) { | ||
| modified["model"] = model_path_; | ||
| } | ||
| return forward_request("/v1/chat/completions", modified); |
There was a problem hiding this comment.
Translate or reject MLX Responses API requests
A /v1/responses request uses the Responses schema (input, max_output_tokens, response-shaped output), not the Chat Completions schema (messages, chat-shaped output). Forwarding it unchanged to /v1/chat/completions makes OpenAI responses.create(...) calls fail or return the wrong response shape for MLX models; the streaming path also still targets /v1/responses, which this backend says mlx-server lacks. Either perform a full request/response translation or return an unsupported-operation error.
Useful? React with 👍 / 👎.
| } | ||
| "clear_bin_if_lemonade_below": "9.4.0", | ||
| "mlx": { | ||
| "system": "26aad7e" |
There was a problem hiding this comment.
Avoid pinning a version for the system MLX backend
For a system backend, get_recipe_version() returns unknown because there is no Lemonade-managed version.txt, but this new expected version makes system_info mark an otherwise available mlx-server as update_required. The advertised action is lemonade backends install mlx:system, which BackendManager::install_backend() immediately no-ops for system, so users are left with a permanently update-required backend even when the binary is in PATH.
Useful? React with 👍 / 👎.
| #endif | ||
|
|
||
| // Start process (always inherit stderr for debugging) | ||
| process_handle_ = ProcessManager::start_process(executable, args, env_vars, true, true); |
There was a problem hiding this comment.
Pass MLX environment overrides via env_vars
Here env_vars is passed as the third argument, but ProcessManager::start_process() treats that parameter as working_dir; the actual environment override parameter is the sixth vector<pair<...>>. On Linux installs that rely on LD_LIBRARY_PATH or ROCm_DIR for ROCm libraries, the spawned mlx-server will not receive those values and may fail to start even though lemond's environment is correct.
Useful? React with 👍 / 👎.
| if (process_handle_.pid > 0) { | ||
| #endif | ||
| ProcessManager::stop_process(process_handle_); | ||
| process_handle_ = {nullptr, 0}; |
There was a problem hiding this comment.
Use WrappedServer handle helpers during MLX unload
MLX directly reads and writes process_handle_ while the backend watchdog/status paths access the same field through process_mutex_, and unload() does not stop the watchdog before killing the process. Once wait_for_ready() has started the watchdog, unloading an MLX model can race with get_process_handle_snapshot()/watchdog reset logic and expose a stale PID; use set_process_handle()/consume_process_handle_for_cleanup() and stop the watchdog like the other subprocess backends.
Useful? React with 👍 / 👎.
|
keep in mind #2287 is coming |
Summary
Adds a new
mlxrecipe backend to lemond that spawns lemon-mlx-engine's OpenAI-compatible HTTP inference server as a subprocess.Architecture
lemond spawns mlx-server as a subprocess, health-checks it, and proxies /v1/chat/completions with automatic model name → filesystem path rewriting.
Files Changed
Verified