Skip to content

ENH: Memory-mapped data management for BaseDataset#422

Draft
oesteban wants to merge 4 commits into
nipreps:mainfrom
oesteban:enh/memory-mapping
Draft

ENH: Memory-mapped data management for BaseDataset#422
oesteban wants to merge 4 commits into
nipreps:mainfrom
oesteban:enh/memory-mapping

Conversation

@oesteban

@oesteban oesteban commented Feb 8, 2026

Copy link
Copy Markdown
Member

Replace in-memory arrays with numpy memory-mapped files throughout the data layer to reduce peak RAM usage and enable subprocess sharing.

Key changes:

  • Add _to_memmap() / _h5_to_memmap() helpers for streaming 4D data to disk-backed numpy memmaps without full materialization
  • Replace _filepath field with _cache_dir directory for organizing memmap files and HDF5 caches
  • Auto-convert dataobj and brainmask to memmap in attrs_post_init
  • Cache original (pre-transform) dataset to HDF5 via _original_h5 for later frame-by-frame retrieval
  • Add _load_original_frame() / _load_original_field() accessors
  • Stream HDF5→memmap in from_filename() instead of materializing
  • DWI: memmap-to-memmap b0 filtering (avoids fancy-index copy)
  • PET: remove lofo_split(); use getitem + _load_original_frame()
  • PET: rewrite set_transform() to use _load_original_frame()
  • PET: delegate to_filename()/from_filename() to base class
  • Define _array_eq comparator (require_same_type=False) for memmap compatibility with attrs equality checks

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 155292cb24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/nifreeze/data/base.py
data[k] = _h5_to_memmap(v, cache_dir / f"{k}.npy")
else:
data[k] = np.asanyarray(v)
return cls(**data, cache_dir=cache_dir, original_h5=Path(filename))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Copy source HDF5 into cache before setting original_h5

from_filename currently stores original_h5=Path(filename), which makes later calls to _load_original_frame() depend on the caller’s original file path instead of a dataset-owned cache. This affects PET flows that now call _load_original_frame (for example PET.set_transform and PETMotionEstimator.run): if the source file is temporary, moved, or overwritten in place, these operations can fail or read non-original frames. Keeping an internal snapshot under cache_dir avoids this fragile external-file coupling.

Useful? React with 👍 / 👎.

Comment thread src/nifreeze/data/base.py
Comment on lines +688 to +689
resampled_path = Path(mkdtemp()) / "resampled.npy"
resampled = np.lib.format.open_memmap(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clean up temporary memmap created during to_nifti

When motion_affines is set, to_nifti now creates resampled.npy under Path(mkdtemp()) and never deletes that directory. In repeated resampling/export workflows, this leaves one full 4D temporary file per call and can exhaust disk space over long runs. The temporary memmap should be lifecycle-managed (or avoided) so the cache is removed after writing/returning the image.

Useful? React with 👍 / 👎.

@codecov

codecov Bot commented Feb 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.39785% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.95%. Comparing base (8aea870) to head (58b51b2).

Files with missing lines Patch % Lines
src/nifreeze/data/base.py 87.69% 7 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #422      +/-   ##
==========================================
+ Coverage   83.79%   83.95%   +0.16%     
==========================================
  Files          37       37              
  Lines        2135     2175      +40     
  Branches      235      241       +6     
==========================================
+ Hits         1789     1826      +37     
- Misses        304      309       +5     
+ Partials       42       40       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@oesteban

oesteban commented Feb 9, 2026

Copy link
Copy Markdown
Member Author

Branch Review: enh/memory-mapping

Commit: 155292cb — "enh: memory-mapped data management for BaseDataset"
Files changed: 11 (234 insertions, 105 deletions)


Goal

Replace in-memory np.ndarray storage in BaseDataset (and subclasses DWI, PET) with np.memmap-backed arrays, so that large 4D neuroimaging datasets live on disk and are paged into RAM on demand rather than fully materialized. The original data is also persisted to an HDF5 file to allow subclasses to retrieve unmodified frames (e.g., before motion correction overwrites them).


Architecture of the Change

1. Two new free functions (base.py:45-115):

  • _to_memmap(arr, path) — copies an ndarray into a .npy memmap, streaming volume-by-volume for 4D. Returns the input unchanged if already a memmap.
  • _h5_to_memmap(h5dset, path) — streams an HDF5 dataset into a memmap, using read_direct for <=3D and volume-by-volume for 4D.

2. BaseDataset.__attrs_post_init__ now:

  • Converts dataobj and brainmask to memmaps under _cache_dir.
  • Writes the original dataset to _original_h5 (HDF5) for later frame retrieval.

3. _filepath becomes a property derived from _cache_dir (backward compat).

4. Two new retrieval methods: _load_original_frame(index) and _load_original_field(field, index).

5. Subclass adaptations:

  • DWI.__attrs_post_init__: calls super(), then filters b0 volumes into a new memmap rather than fancy-indexing the ndarray.
  • PET: removes the lofo_split method entirely; set_transform and the estimator now use _load_original_frame instead.
  • PET.to_filename: delegates to super().to_filename() then patches HDF5 attributes (avoids duplicating the serialization loop).
  • PET.from_filename: delegates to super().from_filename().

6. Estimator: LOFO logic for PET inlined into PETMotionEstimator.run, using __getitem__ + _load_original_frame.


Observations and Potential Issues

1. Double serialization on construction

In BaseDataset.__attrs_post_init__ (base.py:403-412):

self.dataobj = _to_memmap(self.dataobj, self._cache_dir / "dataobj.npy")
...
if self._original_h5 is None:
    self._original_h5 = self._cache_dir / "original.h5"
    self.to_filename(self._original_h5)

Every construction writes the full dataset twice: once to a .npy memmap, then again to an HDF5 file. For a 4D dataset of, say, 2 GB, this means 4 GB of disk I/O at instantiation time. The HDF5 write also reads back from the memmap it just created.

Consider whether the HDF5 cache could be written lazily (on first _load_original_frame call) or whether the memmap itself could serve as the "original" store (since it's created before any subclass modifications when super().__attrs_post_init__() is called first).

2. super().__attrs_post_init__() ordering in DWI

DWI.__attrs_post_init__ (dmri/base.py:140-141) calls super().__attrs_post_init__() first, which writes the original HDF5 before b0 volumes are stripped. This is actually the correct behavior (the original data should include b0s), but it's worth noting that the "original" HDF5 preserves the full gradient table and all volumes, while self.dataobj and self.gradients are subsequently modified. If someone later calls _load_original_frame(idx), idx refers to the original indexing (including b0s), which differs from the current self.dataobj indexing. This could be a source of subtle bugs if callers aren't careful about which index space they're in.

3. Train dataset construction in estimator creates another HDF5 cache

In estimator.py:257-264, each LOFO iteration creates a new PET(...) with original_h5=pet_dataset._original_h5. But the PET.__init__BaseDataset.__attrs_post_init__ path still creates a new memmap and writes a new HDF5 (because the sliced train_data differs from the original). Wait — actually original_h5 is passed, so the if self._original_h5 is None check prevents a new write. But _to_memmap still creates a new .npy file for each train subset in each iteration. For N frames, that's N temp memmap files for the training subsets.

4. MEMMAP_CHUNK_VOL = 1 is very conservative

Copying one volume at a time maximizes the number of I/O operations. For typical neuroimaging volumes (e.g., 100x100x100 x float32 = ~4 MB per volume), batching 10-50 volumes per chunk would dramatically reduce overhead while still keeping peak memory well below the full dataset size.

5. Temp files are never cleaned up

_cache_dir is created via mkdtemp() and there's no __del__, context manager, or atexit cleanup. The to_nifti function (base.py:688) also creates a temp memmap file via mkdtemp() that is never cleaned up. Over the lifetime of a processing session with many datasets, this could accumulate significant disk usage in /tmp.

6. Validator relaxation in validate_dataobj

The check changed from isinstance(value, np.ndarray) to hasattr(value, "shape") and hasattr(value, "dtype"). This is needed because np.memmap is technically an ndarray subclass (so the old check would have worked), but the duck-typing approach is more permissive. It now accepts anything with .shape and .dtype — including, e.g., an h5py Dataset or a torch Tensor. Whether this is intentional broadening or accidental over-relaxation is worth considering.

7. _array_eq with require_same_type=False

_array_eq = attrs.cmp_using(eq=_cmp, require_same_type=False)

This allows comparing np.ndarray to np.memmap for equality, which is needed. The underlying _cmp function uses np.allclose, which handles both types. This looks correct.

8. _cmp still exported but also _array_eq now exported

_cmp is still used in _array_eq and is still importable, but subclasses now import _array_eq instead of _cmp. The old _cmp remains in the public API surface of base.py — this is fine for backward compatibility but _cmp could eventually be made truly private if no external consumers use it.

9. PET to_filename opens the file in "r+" mode

super().to_filename(filename, ...)  # creates file in "w" mode
...
with h5py.File(filename, "r+") as out_file:  # re-opens to patch attrs

This works but is fragile: it assumes super().to_filename() has already created the file. If the filename doesn't get the .h5 extension appended by the parent (edge case: it already ends in .h5), the second open should still work. But there's a subtle issue: super().to_filename() may append .h5 to the filename, but then PET.to_filename re-computes the suffixed path independently. If the logic diverges, the r+ open could target a different path. In practice both use the same logic so it's fine today, but it's a coupling worth noting.

10. Unused _load_original_field method

_load_original_field (base.py:432-451) is defined but never called anywhere in the diff. If it's intended for future use, that's fine, but it adds surface area without a current consumer or test.

11. from_filename passes cache_dir and original_h5 as keyword args

return cls(**data, cache_dir=cache_dir, original_h5=Path(filename))

These are private attrs fields (_cache_dir, _original_h5), and attrs with slots=True expects them as cache_dir and original_h5 (without underscore) in the constructor. This relies on attrs's convention of stripping the leading underscore for the init parameter name. It works, but it's an implementation detail of attrs that could surprise readers.

12. Test changes

  • test_data.py: The monkeypatched SimpleBaseDataset now accepts brainmask via kwargs — consistent with the constructor change.
  • test_integration.py: Removes manual _filepath assignments (no longer needed since _cache_dir is auto-created).
  • test_integration_pet.py: test_lofo_split_shapes now does inline LOFO instead of calling the removed lofo_split. The test still verifies the same invariants.

Summary

The change is well-structured and achieves its goal of reducing memory pressure for large datasets. The main concerns are:

  1. Performance: double-write at construction (memmap + HDF5) and chunk size of 1 volume.
  2. Resource management: temp files never cleaned up.
  3. Index-space ambiguity: _load_original_frame operates in original index space, while self.dataobj may have been filtered (b0 removal in DWI).
  4. Unused code: _load_original_field has no callers.

The PET simplification (removing lofo_split, delegating to_filename/from_filename to the base class) is a clear improvement in code deduplication.


Review generated with Claude Code

@oesteban oesteban marked this pull request as draft February 9, 2026 07:49
oesteban and others added 4 commits June 12, 2026 21:41
Replace in-memory arrays with numpy memory-mapped files throughout
the data layer to reduce peak RAM usage and enable subprocess sharing.

Key changes:
- Add _to_memmap() / _h5_to_memmap() helpers for streaming 4D data
  to disk-backed numpy memmaps without full materialization
- Replace _filepath field with _cache_dir directory for organizing
  memmap files and HDF5 caches
- Auto-convert dataobj and brainmask to memmap in __attrs_post_init__
- Cache original (pre-transform) dataset to HDF5 via _original_h5
  for later frame-by-frame retrieval
- Add _load_original_frame() / _load_original_field() accessors
- Stream HDF5→memmap in from_filename() instead of materializing
- DWI: memmap-to-memmap b0 filtering (avoids fancy-index copy)
- PET: remove lofo_split(); use __getitem__ + _load_original_frame()
- PET: rewrite set_transform() to use _load_original_frame()
- PET: delegate to_filename()/from_filename() to base class
- Define _array_eq comparator (require_same_type=False) for memmap
  compatibility with attrs equality checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Relocate PET `srtm` simulation function test to a dedicated file,
following a more logical testing rationale. Also, it avoids cluttering
the PET model testing file if more testing for the function at issue is
needed.

Eventually all simulation function tests would dwell in this file.
Increase consistency in branch and tag lists in GHA workflow files:
- Use single quotes.
- Quote all branch names, including `main`.
- Adopt flow style (inline) vs block style (multi-line) when listing
  branch names.

Improves compactness, and reduces cognitive burden as the style expected
across files is more consistent.
@oesteban oesteban force-pushed the enh/memory-mapping branch from 58b51b2 to 11b889d Compare June 12, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants