FIX: Parallelize DKI Fit/Predict for Usable Multi-Shell Motion Estimation by oesteban · Pull Request #443 · nipreps/nifreeze

oesteban · 2026-06-09T20:35:28Z

Summary

Used as the head-motion/eddy estimator for dMRI, the DKI model was effectively unusable on real multi-shell data (~8.6 vol/h ⇒ ~32 h/series, appearing hung), while DTI on the same data runs in minutes. This PR makes DKI parallelize correctly, recovering a ~5.8× speedup (more with more workers) with numerically identical results.

Resolves: #442

Root cause

DKI was hard-routed onto the serial full-brain fit path (if n_jobs == 1 or is_dki:), so per-volume cost grew linearly with voxel count while DTI's np.array_split + joblib path stayed flat (measured 6×→40×→119× slower as voxels grew 2k→20k→80k). The serial special-case existed for a real reason: DKI's MultiVoxelFit cannot be pickled across loky's process boundary, so naively reusing DTI's fit/predict-split path crashes (BrokenProcessPool / RecursionError). BLAS thread-capping was ruled out as a latency factor (no effect — the fit is GIL-bound per-voxel Python work). Full diagnosis with timings in #442.

Fix

Add an exact, in-worker parallel path for models whose fitted object cannot cross a process boundary:

New _exec_fit_predict(...) worker: builds the model, fits its voxel chunk, predicts the held-out gradient, and returns only the predicted ndarray — no model instance is ever serialized.
New BaseDWIModel._fit_predict_chunked(index, n_jobs): splits voxels (and the aligned S0) across workers, runs the worker per chunk, and re-assembles the prediction. Because voxel-wise fitting is independent, the result is bit-identical to the serial path.
fit_predict routes to this path when not self._picklable_fit and n_jobs > 1. DKIModel sets _picklable_fit = False; all other models keep the existing behavior.
Removed the unreachable elif is_dki: (dead model.multi_fit branch) and factored the shared data prep into _lovo_data.

The DTI fit/predict-split path is untouched. DKI at n_jobs == 1, direct _fit calls, and single_fit are unchanged.

Benchmark (synthetic multi-shell, b0 + 3 shells, n_jobs=8)

voxels	DKI serial (before)	DKI parallel (after)	speedup
20,000	19.3 s/vol	3.3 s/vol	5.8×

Per-volume output matches the serial path exactly (max|diff| = 0).

Tests

New test_dki_parallel_matches_serial (index ∈ {4,9} × use_mask ∈ {False,True}) asserts fit_predict(index, n_jobs=4) (chunked) equals fit_predict(index, n_jobs=1) (serial).
Full test_model_dmri.py DKI/DTI suite green (132 passed).

Notes

single_fit (fit-once, predict-each) could cut runtime further but is approximate and currently buggy for n_jobs > 1; left for a follow-up.
CHANGES.rst intentionally untouched (release-generated).

DKI was forced onto the serial full-brain fit path, making volume-to-volume motion estimation grow linearly with voxel count (~120x slower than DTI on multi-shell data; ~32 h/series, effectively hung). Its fitted MultiVoxelFit object cannot be pickled across loky's process boundary, so the fit/predict- split parallelization DTI uses crashes for DKI. Add an exact in-worker fit+predict path (_fit_predict_chunked) for models flagged `_picklable_fit = False`: each worker fits its voxel chunk and predicts the held-out gradient, returning only the (picklable) predicted array. Voxel- wise fitting is independent, so results are numerically identical to the serial path (max|diff| = 0) at ~5.8x speedup (n_jobs=8). Also remove the unreachable `elif is_dki` branch and factor out _lovo_data. Resolves: nipreps#442 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-09T20:48:22Z

Codecov Report

❌ Patch coverage is 81.08108% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@85e3a45). Learn more about missing BASE report.
⚠️ Report is 94 commits behind head on main.

Files with missing lines	Patch %	Lines
src/nifreeze/model/dmri.py	81.08%	6 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #443   +/-   ##
=======================================
  Coverage        ?   84.83%           
=======================================
  Files           ?       37           
  Lines           ?     2183           
  Branches        ?      245           
=======================================
  Hits            ?     1852           
  Misses          ?      295           
  Partials        ?       36

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Record (niprepsgh-442) that the in-worker chunking for DKI can be replaced by DIPY's own multi_voxel engine path once a release forwards `engine` through `DKIModel.fit` and stops leaking orchestration kwargs into the per-voxel kernel (both broken in DIPY <= 1.10.0; arokem's suggested route). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

oesteban · 2026-06-09T21:24:16Z

Re: @arokem's suggestion to use DIPY's multi_voxel decorator (Ray/engine) — I tested it against the pinned DIPY (1.10.0) and it isn't usable on our current floor: DKIModel.fit() doesn't accept engine, and the reachable multi_fit(engine=...) path hits a decorator kwargs-leak bug (engine forwarded into ls_fit_dki()); ray also isn't a dependency. Details in #442 (comment). This PR's in-worker chunking is therefore the interim that works on dipy>=1.5; a code comment marks the DIPY-native path as the follow-up once a release supports it.

arokem · 2026-06-09T21:28:44Z

+    The fit happens inside the worker and only the predicted array is returned,
+    so no model instance is ever serialized. See gh-442.
+    """
+    module_name, class_name = model_class.rsplit(".", 1)


Maybe

Suggested change

module_name, class_name = model_class.rsplit(".", 1)

module_name, class_name = model_class.rsplit(".", -1)

in case of dipy.reconst.dti and similar?

Since it's right split, should be 1, right?

Co-authored-by: Ariel Rokem <arokem@gmail.com>

oesteban mentioned this pull request Jun 9, 2026

DKI head-motion estimation is impractically slow (~8.6 vol/h, ~32 h/series) #442

Open

arokem approved these changes Jun 9, 2026

View reviewed changes

Update src/nifreeze/model/dmri.py

88e5ba6

Co-authored-by: Ariel Rokem <arokem@gmail.com>

oesteban mentioned this pull request Jun 10, 2026

BUG: multi_voxel_fit decorator forwards orchestration kwargs into the per-voxel fit function dipy/dipy#4053

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Parallelize DKI Fit/Predict for Usable Multi-Shell Motion Estimation#443

FIX: Parallelize DKI Fit/Predict for Usable Multi-Shell Motion Estimation#443
oesteban wants to merge 3 commits into
nipreps:mainfrom
oesteban:fix/dki-speed

oesteban commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026

Uh oh!

oesteban commented Jun 9, 2026

Uh oh!

arokem Jun 9, 2026

Uh oh!

oesteban Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	module_name, class_name = model_class.rsplit(".", 1)
	module_name, class_name = model_class.rsplit(".", -1)

Conversation

oesteban commented Jun 9, 2026

Summary

Root cause

Fix

Benchmark (synthetic multi-shell, b0 + 3 shells, n_jobs=8)

Tests

Notes

Uh oh!

codecov Bot commented Jun 9, 2026

Codecov Report

Uh oh!

oesteban commented Jun 9, 2026

Uh oh!

arokem Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

oesteban Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants