Fix - Model Download Race Condition#2305
Conversation
…ace conditions This commit introduces a mechanism to ensure that only one download operation occurs at a time for each model. It adds the `awaitExistingModelDownload` function to wait for any in-progress downloads before initiating a new one, preventing potential conflicts and data corruption. Additionally, a mutex-based locking system is implemented in the `ModelManager` and `HttpClient` to manage concurrent download requests effectively. This enhancement improves the reliability of model downloads across multiple requests.
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks for working on this — the direction looks good, especially the path-level serialization in HttpClient and the per-model lock in ModelManager.
I think this still needs changes before merge, I noted them below.
Suggested fixes:
- Make /load and collection component auto-downloads cache-first where appropriate, e.g. download_registered_model(info, true).
- Use the same canonical model identity for server download job keys and frontend active-download matching.
- Add a regression test for /pull + /load concurrency and, ideally, alias/canonical model names
| std::lock_guard<std::mutex> download_guard(*model_lock); | ||
|
|
||
| // Another caller may have finished while we waited for the model lock. | ||
| if (do_not_upgrade && is_model_downloaded(info.model_name)) { |
There was a problem hiding this comment.
This post-lock re-check only catches cache-first callers. /load still calls download_registered_model(info) with the default do_not_upgrade=false, so a /load request queued behind an in-flight /pull can wait on this mutex and then still proceed into the download/update path. Could we either make /load call download_registered_model(info, true) or make this guard explicitly skip when the model became downloaded while waiting?
This matters because handle_load() currently downloads missing models with download_registered_model(info) and no do_not_upgrade=true.
| auto operation = [this, model_name, request_json, do_not_upgrade](DownloadProgressCallback progress_cb) { | ||
| model_manager_->download_model(model_name, request_json, do_not_upgrade, progress_cb); | ||
| }; | ||
| auto job = start_download_job("model:" + model_name, "model", model_name, operation); |
There was a problem hiding this comment.
This job key uses the raw request model_name, while the ModelManager download lock uses resolve_model_name(model_name). Alias vs canonical requests for the same logical model can therefore create separate server job IDs/UI rows even though they serialize lower down. Could we key the download job with the same resolved model name used by get_model_download_lock()?
|
|
||
| const serverDownloads = await downloadTracker.hydrateFromServer(); | ||
| const active = serverDownloads.find( | ||
| item => item.model_name === modelName && |
There was a problem hiding this comment.
Same alias/canonicalization issue on the client side: this only detects an existing server download if item.model_name exactly equals the caller’s modelName. If another caller started the same model through a different alias, ensureModelReady() can miss the active job and start another /pull. Could we normalize model names here or have the server snapshot expose a canonical model ID to compare against?
| if (!isDownloaded) { | ||
| await pullModel(modelName, { declaredSizeGB: modelsData[modelName]?.size }); | ||
| if (downloadTracker.isActive(modelName) || | ||
| await downloadTracker.hasActiveServerDownload(modelName)) { |
There was a problem hiding this comment.
This pre-flight check has the same exact-match limitation as awaitExistingModelDownload(). If server-side dedupe is intended to be “one download per logical model”, the client check should use the same identity semantics as the server lock, not only the raw UI/request name.
| const std::map<std::string, std::string>& headers, | ||
| const DownloadOptions& options) { | ||
| auto path_lock = g_path_download_locks.acquire(output_path); | ||
| std::lock_guard<std::mutex> path_guard(*path_lock); |
There was a problem hiding this comment.
This path-level lock is a good last line of defense against .partial corruption. Given that the higher-level model/job dedupe can still miss alias/canonical cases, could we add a regression test that races two callers against the same output path and verifies that the partial file is not concurrently written/corrupted?
This commit introduces a mechanism to ensure that only one download operation occurs at a time for each model. It adds the
awaitExistingModelDownloadfunction to wait for any in-progress downloads before initiating a new one, preventing potential conflicts and data corruption. Additionally, a mutex-based locking system is implemented in theModelManagerandHttpClientto manage concurrent download requests effectively. This enhancement improves the reliability of model downloads across multiple requests.