feat: add chunked histograms#685
Conversation
92b722a to
0d1fe90
Compare
Assisted-by: Kimi-K2.6 Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
0d1fe90 to
bf707f1
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ChunkedHist implementation to store histograms with categorical (chunk) axes as a dict of dense backing arrays keyed by categorical values, avoiding repeated dense reallocations when categories grow (as in issue #684).
Changes:
- Added
src/hist/chunked.pyimplementingChunkedHist, chunk-key selection (including wildcard support forStrCategory), materialization viato_hist(), and merging via+/+=. - Added
tests/test_chunked.pycovering construction, filling, selection, merging, materialization, and basic utility behaviors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/hist/chunked.py |
Implements the ChunkedHist data structure, fill/materialize/select/merge logic, and helper utilities for chunk-key normalization and dense-view accumulation. |
tests/test_chunked.py |
Adds a comprehensive test suite for the new ChunkedHist API and expected behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| np.ascontiguousarray(chunk_view), | ||
| shape=self.dense_view_shape, | ||
| dtype=self.dense_view_dtype, | ||
| ) | ||
| if not array.flags.writeable: | ||
| array = array.copy() |
| values = (raw_value,) if isinstance(raw_value, str | int) else tuple(raw_value) | ||
| if not values: | ||
| raise ValueError(f"slice for axis {axis_name!r} must be non-empty") | ||
| normalized[axis_name] = tuple( | ||
| _normalize_chunk_scalar(value) for value in values |
| axes = list(self.axes) | ||
| keys_by_axis = self._keys_by_axis(self._chunks) | ||
| for spec in self.chunk_axes: | ||
| keys = keys_by_axis[spec.name] | ||
| if issubclass(spec.axis_type, bh.axis.IntCategory): |
| if not matches: | ||
| raise ValueError(f"No matches found for {pattern!r}") |
| axes_repr = ",\n ".join( | ||
| # ChunkedHist categorical axes are always growable in practice | ||
| f"{type(axis).__name__}(..., growth=True, name={axis.name!r})" | ||
| if isinstance(axis, bh.axis.IntCategory | bh.axis.StrCategory) | ||
| else repr(axis) | ||
| for axis in self.axes | ||
| ) |
|
|
||
|
|
pfackeldey
left a comment
There was a problem hiding this comment.
Looks good to me!
There are a few small things, I'm not sure what your preference are here @henryiii, e.g., allowing wildcard matching of categories in getitem, or the fix in _save_chunk_view (which sounds like something that should be fixed...).
Close #684.
🤖 Suggested followups
Here are the natural followups, roughly ordered by impact vs. effort:
1. Top-level export (
from hist import ChunkedHist)Currently only
from hist.chunked import ChunkedHistworks. Adding it tohist/__init__.pyis trivial and would make it discoverable.2. Array-valued chunk axes in
fill()Right now
fill()requires scalar chunk-axis values:Supporting array-valued chunk axes would group by chunk key and dispatch to multiple chunks in one call. This is a common user expectation.
Tradeoff: More complex because you need to group the dense-axis data by chunk key and call
dense_hist.fill()per group.3. Native chunked UHI serialization
Right now round-tripping through JSON/bytes requires
to_hist()first, which is expensive for large histograms. A native format that serializes chunk metadata + individual chunk arrays would avoid materialization.4. Custom
__getstate__/__setstate__For pickle/dill interop. Without this, pickling a
ChunkedHistwon't work correctly (it has unpicklable internal state like the scratch hist reference).5.
Reporter-style operations:*,-,/,**Only
+/+=are implemented. Multiplication, subtraction, division could be useful for e.g. weighted subtraction of backgrounds.6. Relax
Mean/WeightedMeanstorage restrictionThe scratch-histogram reuse trick is trickier with structured storages, but it's solvable with per-field accumulation.
7. Support transformed
RegularaxesCurrently
Regular(..., transform=...)is rejected. This is just a validation gate that can be lifted once tested.8. Thread-safe filling
The current
fill()reuses a single scratchHistper instance. Parallel filling from multiple threads would race on that scratch buffer. Options:fill()Histper fill (slower but simpler)9.
chunk_view()on missing keys returns zeros instead of raisingCurrently missing chunks raise
KeyError. Some workflows might prefer getting a zeroed view for missing chunks (likedict.get()).10. Documentation page
A short user-guide section explaining when to reach for
ChunkedHistvs. plainHistvs.daskhist.My suggestion for priority order
What resonates with you?
🤖 Assisted-by: Kimi-K2.6