Skip to content

feat: add chunked histograms#685

Draft
henryiii wants to merge 1 commit into
mainfrom
henryiii/feat/chunked
Draft

feat: add chunked histograms#685
henryiii wants to merge 1 commit into
mainfrom
henryiii/feat/chunked

Conversation

@henryiii
Copy link
Copy Markdown
Member

@henryiii henryiii commented May 6, 2026

Close #684.

🤖 Suggested followups

Here are the natural followups, roughly ordered by impact vs. effort:


1. Top-level export (from hist import ChunkedHist)

Currently only from hist.chunked import ChunkedHist works. Adding it to hist/__init__.py is trivial and would make it discoverable.


2. Array-valued chunk axes in fill()

Right now fill() requires scalar chunk-axis values:

h.fill(x=[0.1, 0.2], cat="a")   # OK
h.fill(x=[0.1, 0.2], cat=["a", "b"])  # raises

Supporting array-valued chunk axes would group by chunk key and dispatch to multiple chunks in one call. This is a common user expectation.

Tradeoff: More complex because you need to group the dense-axis data by chunk key and call dense_hist.fill() per group.


3. Native chunked UHI serialization

Right now round-tripping through JSON/bytes requires to_hist() first, which is expensive for large histograms. A native format that serializes chunk metadata + individual chunk arrays would avoid materialization.


4. Custom __getstate__ / __setstate__

For pickle/dill interop. Without this, pickling a ChunkedHist won't work correctly (it has unpicklable internal state like the scratch hist reference).


5. Reporter-style operations: *, -, /, **

Only + / += are implemented. Multiplication, subtraction, division could be useful for e.g. weighted subtraction of backgrounds.


6. Relax Mean / WeightedMean storage restriction

The scratch-histogram reuse trick is trickier with structured storages, but it's solvable with per-field accumulation.


7. Support transformed Regular axes

Currently Regular(..., transform=...) is rejected. This is just a validation gate that can be lifted once tested.


8. Thread-safe filling

The current fill() reuses a single scratch Hist per instance. Parallel filling from multiple threads would race on that scratch buffer. Options:

  • A thread-local scratch buffer
  • A lock around fill()
  • Remove the scratch buffer entirely and create a temporary Hist per fill (slower but simpler)

9. chunk_view() on missing keys returns zeros instead of raising

Currently missing chunks raise KeyError. Some workflows might prefer getting a zeroed view for missing chunks (like dict.get()).


10. Documentation page

A short user-guide section explaining when to reach for ChunkedHist vs. plain Hist vs. dask hist.


My suggestion for priority order

  1. Top-level export (trivial, high visibility)
  2. Pickle support (needed for real workflows)
  3. Array-valued chunk axes (biggest UX improvement)
  4. Native serialization (needed for large-scale I/O)
  5. Thread safety (needed for parallel analysis)

What resonates with you?

🤖 Assisted-by: Kimi-K2.6

@henryiii henryiii force-pushed the henryiii/feat/chunked branch 2 times, most recently from 92b722a to 0d1fe90 Compare May 6, 2026 21:07
Assisted-by: Kimi-K2.6
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ChunkedHist implementation to store histograms with categorical (chunk) axes as a dict of dense backing arrays keyed by categorical values, avoiding repeated dense reallocations when categories grow (as in issue #684).

Changes:

  • Added src/hist/chunked.py implementing ChunkedHist, chunk-key selection (including wildcard support for StrCategory), materialization via to_hist(), and merging via + / +=.
  • Added tests/test_chunked.py covering construction, filling, selection, merging, materialization, and basic utility behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/hist/chunked.py Implements the ChunkedHist data structure, fill/materialize/select/merge logic, and helper utilities for chunk-key normalization and dense-view accumulation.
tests/test_chunked.py Adds a comprehensive test suite for the new ChunkedHist API and expected behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/hist/chunked.py
Comment on lines +286 to +291
np.ascontiguousarray(chunk_view),
shape=self.dense_view_shape,
dtype=self.dense_view_dtype,
)
if not array.flags.writeable:
array = array.copy()
Comment thread src/hist/chunked.py
Comment on lines +112 to +116
values = (raw_value,) if isinstance(raw_value, str | int) else tuple(raw_value)
if not values:
raise ValueError(f"slice for axis {axis_name!r} must be non-empty")
normalized[axis_name] = tuple(
_normalize_chunk_scalar(value) for value in values
Comment thread src/hist/chunked.py
Comment on lines +381 to +385
axes = list(self.axes)
keys_by_axis = self._keys_by_axis(self._chunks)
for spec in self.chunk_axes:
keys = keys_by_axis[spec.name]
if issubclass(spec.axis_type, bh.axis.IntCategory):
Comment thread src/hist/chunked.py
Comment on lines +548 to +549
if not matches:
raise ValueError(f"No matches found for {pattern!r}")
Comment thread src/hist/chunked.py
Comment on lines +594 to +600
axes_repr = ",\n ".join(
# ChunkedHist categorical axes are always growable in practice
f"{type(axis).__name__}(..., growth=True, name={axis.name!r})"
if isinstance(axis, bh.axis.IntCategory | bh.axis.StrCategory)
else repr(axis)
for axis in self.axes
)
Comment thread tests/test_chunked.py
Comment on lines +266 to +267


@henryiii henryiii requested a review from pfackeldey May 7, 2026 02:49
Copy link
Copy Markdown

@pfackeldey pfackeldey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

There are a few small things, I'm not sure what your preference are here @henryiii, e.g., allowing wildcard matching of categories in getitem, or the fix in _save_chunk_view (which sounds like something that should be fixed...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] ChunkedHist

3 participants