Skip to content

Fix - Model Download Race Condition#2305

Open
VultureZZ wants to merge 2 commits into
lemonade-sdk:mainfrom
VultureZZ:pr/download-sync
Open

Fix - Model Download Race Condition#2305
VultureZZ wants to merge 2 commits into
lemonade-sdk:mainfrom
VultureZZ:pr/download-sync

Conversation

@VultureZZ

Copy link
Copy Markdown

This commit introduces a mechanism to ensure that only one download operation occurs at a time for each model. It adds the awaitExistingModelDownload function to wait for any in-progress downloads before initiating a new one, preventing potential conflicts and data corruption. Additionally, a mutex-based locking system is implemented in the ModelManager and HttpClient to manage concurrent download requests effectively. This enhancement improves the reliability of model downloads across multiple requests.

VultureZZ and others added 2 commits June 18, 2026 09:34
…ace conditions

This commit introduces a mechanism to ensure that only one download operation occurs at a time for each model. It adds the `awaitExistingModelDownload` function to wait for any in-progress downloads before initiating a new one, preventing potential conflicts and data corruption. Additionally, a mutex-based locking system is implemented in the `ModelManager` and `HttpClient` to manage concurrent download requests effectively. This enhancement improves the reliability of model downloads across multiple requests.
@github-actions github-actions Bot added the bug Something isn't working label Jun 18, 2026

@fl0rianr fl0rianr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this — the direction looks good, especially the path-level serialization in HttpClient and the per-model lock in ModelManager.

I think this still needs changes before merge, I noted them below.

Suggested fixes:

  • Make /load and collection component auto-downloads cache-first where appropriate, e.g. download_registered_model(info, true).
  • Use the same canonical model identity for server download job keys and frontend active-download matching.
  • Add a regression test for /pull + /load concurrency and, ideally, alias/canonical model names

std::lock_guard<std::mutex> download_guard(*model_lock);

// Another caller may have finished while we waited for the model lock.
if (do_not_upgrade && is_model_downloaded(info.model_name)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This post-lock re-check only catches cache-first callers. /load still calls download_registered_model(info) with the default do_not_upgrade=false, so a /load request queued behind an in-flight /pull can wait on this mutex and then still proceed into the download/update path. Could we either make /load call download_registered_model(info, true) or make this guard explicitly skip when the model became downloaded while waiting?

This matters because handle_load() currently downloads missing models with download_registered_model(info) and no do_not_upgrade=true.

Comment thread src/cpp/server/server.cpp
auto operation = [this, model_name, request_json, do_not_upgrade](DownloadProgressCallback progress_cb) {
model_manager_->download_model(model_name, request_json, do_not_upgrade, progress_cb);
};
auto job = start_download_job("model:" + model_name, "model", model_name, operation);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job key uses the raw request model_name, while the ModelManager download lock uses resolve_model_name(model_name). Alias vs canonical requests for the same logical model can therefore create separate server job IDs/UI rows even though they serialize lower down. Could we key the download job with the same resolved model name used by get_model_download_lock()?


const serverDownloads = await downloadTracker.hydrateFromServer();
const active = serverDownloads.find(
item => item.model_name === modelName &&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same alias/canonicalization issue on the client side: this only detects an existing server download if item.model_name exactly equals the caller’s modelName. If another caller started the same model through a different alias, ensureModelReady() can miss the active job and start another /pull. Could we normalize model names here or have the server snapshot expose a canonical model ID to compare against?

if (!isDownloaded) {
await pullModel(modelName, { declaredSizeGB: modelsData[modelName]?.size });
if (downloadTracker.isActive(modelName) ||
await downloadTracker.hasActiveServerDownload(modelName)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pre-flight check has the same exact-match limitation as awaitExistingModelDownload(). If server-side dedupe is intended to be “one download per logical model”, the client check should use the same identity semantics as the server lock, not only the raw UI/request name.

const std::map<std::string, std::string>& headers,
const DownloadOptions& options) {
auto path_lock = g_path_download_locks.acquire(output_path);
std::lock_guard<std::mutex> path_guard(*path_lock);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path-level lock is a good last line of defense against .partial corruption. Given that the higher-level model/job dedupe can still miss alias/canonical cases, could we add a regression test that races two callers against the same output path and verifies that the partial file is not concurrently written/corrupted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants