ENH: Memory-mapped data management for BaseDataset#422
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 155292cb24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| data[k] = _h5_to_memmap(v, cache_dir / f"{k}.npy") | ||
| else: | ||
| data[k] = np.asanyarray(v) | ||
| return cls(**data, cache_dir=cache_dir, original_h5=Path(filename)) |
There was a problem hiding this comment.
Copy source HDF5 into cache before setting original_h5
from_filename currently stores original_h5=Path(filename), which makes later calls to _load_original_frame() depend on the caller’s original file path instead of a dataset-owned cache. This affects PET flows that now call _load_original_frame (for example PET.set_transform and PETMotionEstimator.run): if the source file is temporary, moved, or overwritten in place, these operations can fail or read non-original frames. Keeping an internal snapshot under cache_dir avoids this fragile external-file coupling.
Useful? React with 👍 / 👎.
| resampled_path = Path(mkdtemp()) / "resampled.npy" | ||
| resampled = np.lib.format.open_memmap( |
There was a problem hiding this comment.
Clean up temporary memmap created during to_nifti
When motion_affines is set, to_nifti now creates resampled.npy under Path(mkdtemp()) and never deletes that directory. In repeated resampling/export workflows, this leaves one full 4D temporary file per call and can exhaust disk space over long runs. The temporary memmap should be lifecycle-managed (or avoided) so the cache is removed after writing/returning the image.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #422 +/- ##
==========================================
+ Coverage 83.79% 83.95% +0.16%
==========================================
Files 37 37
Lines 2135 2175 +40
Branches 235 241 +6
==========================================
+ Hits 1789 1826 +37
- Misses 304 309 +5
+ Partials 42 40 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Branch Review:
|
Replace in-memory arrays with numpy memory-mapped files throughout the data layer to reduce peak RAM usage and enable subprocess sharing. Key changes: - Add _to_memmap() / _h5_to_memmap() helpers for streaming 4D data to disk-backed numpy memmaps without full materialization - Replace _filepath field with _cache_dir directory for organizing memmap files and HDF5 caches - Auto-convert dataobj and brainmask to memmap in __attrs_post_init__ - Cache original (pre-transform) dataset to HDF5 via _original_h5 for later frame-by-frame retrieval - Add _load_original_frame() / _load_original_field() accessors - Stream HDF5→memmap in from_filename() instead of materializing - DWI: memmap-to-memmap b0 filtering (avoids fancy-index copy) - PET: remove lofo_split(); use __getitem__ + _load_original_frame() - PET: rewrite set_transform() to use _load_original_frame() - PET: delegate to_filename()/from_filename() to base class - Define _array_eq comparator (require_same_type=False) for memmap compatibility with attrs equality checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Relocate PET `srtm` simulation function test to a dedicated file, following a more logical testing rationale. Also, it avoids cluttering the PET model testing file if more testing for the function at issue is needed. Eventually all simulation function tests would dwell in this file.
Increase consistency in branch and tag lists in GHA workflow files: - Use single quotes. - Quote all branch names, including `main`. - Adopt flow style (inline) vs block style (multi-line) when listing branch names. Improves compactness, and reduces cognitive burden as the style expected across files is more consistent.
58b51b2 to
11b889d
Compare
Replace in-memory arrays with numpy memory-mapped files throughout the data layer to reduce peak RAM usage and enable subprocess sharing.
Key changes: