Migrate to polars by d33bs · Pull Request #214 · cytomining/CytoDataFrame

d33bs · 2026-06-19T16:57:52Z

Description

What kind of change(s) are included?

Documentation (changes docs or other related content)
Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

I have read and followed the CONTRIBUTING.md guidelines.
I have searched for existing content to ensure this is not a duplicate.
I have performed a self-review of these additions (including spelling, grammar, and related).
These changes pass all pre-commit checks.
I have added comments to my code to help provide understanding
I have added a test which covers the code changes found within this PR
I have deleted all non-relevant text in this pull request template.

Summary by CodeRabbit

Release Notes

New Features
- Added seamless Polars/Arrow interoperability, including conversion methods across eager and lazy forms.
- Introduced lazy query support for filtering, selecting features, grouping, joining, and collecting results back into the main data frame type.
- Added an Arrow-native schema system for deterministic column classification, validation, and struct shaping helpers.
- Enabled lazy Parquet scanning for pipeline-friendly workflows.
Documentation
- Expanded README with interoperability details and clearer optional installation extras.
Tests
- Added coverage for engine conversions, lazy execution, schema inference/validation, and Parquet pipelines.

coderabbitai · 2026-06-19T16:58:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4172ab5-1607-4496-8d72-f4dc23945e8a

📥 Commits

Reviewing files that changed from the base of the PR and between 16e3977 and adfdbd1.

📒 Files selected for processing (1)

README.md

✅ Files skipped from review due to trivial changes (1)

README.md

📝 Walkthrough

Walkthrough

Adds a multi-backend data engine (engine.py) normalizing pandas, polars, and pyarrow inputs; a CytoLazyFrame lazy query wrapper carrying image/display context through Polars operations; and a CytoSchema column classifier with Arrow struct helpers. CytoDataFrame is extended with new constructor branches and public interchange/lazy APIs. Packaging, docs, and test coverage are updated throughout.

Changes

Polars/Arrow Backend and Lazy Query Layer

Layer / File(s)	Summary
Engine abstraction: TabularData type and conversions `src/cytodataframe/engine.py`	Defines `TabularData` union, lazy optional imports, runtime type predicates (`is_polars_dataframe`, `is_arrow_table`, etc.), and all format-conversion functions (`to_pandas`, `to_polars`, `to_lazyframe`, `to_arrow`, `normalize_to_pandas`, `scan_parquet`, `read_parquet`).
CytoSchema: column classification and Arrow struct helpers `src/cytodataframe/schema.py`	Adds `CytoSchema` dataclass with regex-based column bucketing into metadata/feature/geometry/image lists, dispatch inference from pandas/polars/arrow, `validate`/`require`/`to_dict` APIs, and `add_bbox_struct`/`add_centroid_struct` nested Arrow struct helpers.
CytoLazyFrame: lazy query wrapper with context carry-through `src/cytodataframe/lazy.py`	Adds `CytoLazyGroupBy` and `CytoLazyFrame` wrapping `polars.LazyFrame`, forwarding lazy ops (filter, select, join, group_by, select_features) while preserving image/display context, and materializing via `collect`/`to_polars`/`to_arrow`/`to_pandas`. Adds `build_context`, `scan_parquet`, and `from_sequence_context` helpers.
CytoDataFrame constructor normalization and interchange API `src/cytodataframe/frame.py`	Extends `__init__` to detect and normalize polars/arrow inputs via `engine.normalize_to_pandas`; adds public `to_pandas`, `to_polars`, `to_lazy`, `to_arrow`, `cyto_schema`, `from_file`, and `scan_parquet` methods.
Package wiring and dependency restructuring `src/cytodataframe/__init__.py`, `pyproject.toml`, `.pre-commit-config.yaml`	Exports `CytoLazyFrame`, `CytoSchema`, and `engine` via `__all__`; moves heavy deps to optional extras, adds `polars` as a base dependency, expands dev group with `hypothesis`/`pyvista`/`trame`, and bumps `pyproject-fmt` and `ruff-pre-commit` versions.
Tests and documentation `tests/test_engine.py`, `tests/test_lazy.py`, `tests/test_schema.py`, `README.md`	Adds engine conversion/round-trip tests, lazy pipeline tests (filter, select_features, group_by, join, context carry-through, parquet scan), schema inference/validation/Hypothesis partitioning/struct-helper tests, and README sections on Polars/Arrow interop and optional extras.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant CytoDataFrame
    participant engine
    participant CytoLazyFrame
    participant PolarsLazyFrame
    participant CytoSchema

    Caller->>CytoDataFrame: CytoDataFrame(polars_df / arrow_table)
    CytoDataFrame->>engine: normalize_to_pandas(data)
    engine-->>CytoDataFrame: pandas.DataFrame

    Caller->>CytoDataFrame: to_lazy()
    CytoDataFrame->>CytoLazyFrame: __init__(data, context=build_context(_custom_attrs))
    CytoLazyFrame->>engine: to_lazyframe(data)
    engine-->>CytoLazyFrame: polars.LazyFrame

    Caller->>CytoLazyFrame: filter(...).select_features()
    CytoLazyFrame->>CytoSchema: infer(lazyframe) → feature/metadata columns
    CytoSchema-->>CytoLazyFrame: column buckets
    CytoLazyFrame->>PolarsLazyFrame: .filter(...).select(cols)
    PolarsLazyFrame-->>CytoLazyFrame: new LazyFrame

    Caller->>CytoLazyFrame: collect()
    CytoLazyFrame->>PolarsLazyFrame: .collect()
    PolarsLazyFrame-->>CytoLazyFrame: polars.DataFrame
    CytoLazyFrame->>CytoDataFrame: CytoDataFrame(result, **preserved_context)
    CytoDataFrame-->>Caller: CytoDataFrame

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

jenna-tomkinson

Poem

🐇 Hop, hop through frames so lazy and lean,
Arrow and Polars make columns gleam!
CytoSchema sorts each column with care,
Metadata here, features over there.
Collect it all — the rabbit's dream! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 67.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Migrate to polars' directly describes the primary objective of the pull request, which involves migrating the CytoDataFrame codebase to use Polars as the execution engine.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

d33bs · 2026-06-19T17:07:33Z

@coderabbitai review

coderabbitai · 2026-06-19T17:07:40Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 28: The "Polars and Arrow interoperability" section heading uses three
hashes (###) which violates markdown heading hierarchy and triggers the MD001
rule. Change the heading from ### Polars and Arrow interoperability to ## Polars
and Arrow interoperability to maintain proper heading structure and compliance
with markdown linting rules.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bbce1cd8-fbc7-4b71-9822-e59555fe6b2b

📥 Commits

Reviewing files that changed from the base of the PR and between e8ad57b and 16e3977.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

.pre-commit-config.yaml
README.md
pyproject.toml
src/cytodataframe/__init__.py
src/cytodataframe/engine.py
src/cytodataframe/frame.py
src/cytodataframe/lazy.py
src/cytodataframe/schema.py
tests/test_engine.py
tests/test_lazy.py
tests/test_schema.py

…Frame into migrate-to-polars

d33bs · 2026-06-19T20:20:11Z

@coderabbitai review

coderabbitai · 2026-06-19T20:20:18Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

gwaybio

Very excited to see this development! I made several comments, mostly regarding naming and understandability. Happy Fathers Day!

gwaybio · 2026-06-21T13:53:21Z

+    return isinstance(data, pa.Table)
+
+
+def is_supported(data: Any) -> bool:


consider renaming to is_dataframe_engine_supported (or something like that) - specifying exactly what this is testing

gwaybio · 2026-06-21T13:58:14Z

+    if isinstance(data, pd.Series):
+        data = data.to_frame()
+    if isinstance(data, pd.DataFrame):
+        # Strip any pandas subclass (e.g. CytoDataFrame) and index before handing


is this comment accurate? I don't see any manipulations below, just an exception?

gwaybio · 2026-06-21T13:59:11Z

+    Convert any supported tabular input to a :class:`pyarrow.Table`.
+
+    Arrow is the canonical schema/serialization contract, so this is the
+    conversion used whenever schema or interchange guarantees matter.


"whenever schema or interchange guarantees matter" is terse - consider defining technical terms or using simpler more direct language

gwaybio · 2026-06-21T14:15:44Z

+    """
+    Lazily scan a Parquet file/dataset into a :class:`polars.LazyFrame`.
+
+    This enables predicate/projection pushdown for large profiling datasets


can you elaborate slightly on what "predicate/projection pushdown" is?

gwaybio · 2026-06-21T14:18:22Z

+        .collect()
+    )
+
+It is intentionally a *separate* type from ``CytoDataFrame`` so that its


thanks for this comment!

gwaybio · 2026-06-21T14:46:43Z

+# Construct from pandas, polars (DataFrame or LazyFrame), or a pyarrow Table.
+cdf = CytoDataFrame("profiles.parquet")
+
+# Convert out to any representation (Pandas stays a boundary layer).


because this is in the main readme, consider defining what you mean by "boundary layer"

gwaybio · 2026-06-21T14:47:00Z

+cdf.to_lazy()     # CytoLazyFrame (lazy, Polars-backed)
+
+# Inspect the inferred schema (metadata / feature / geometry roles).
+cdf.cyto_schema


consider previewing what this output looks like

gwaybio · 2026-06-21T14:47:12Z

+# Inspect the inferred schema (metadata / feature / geometry roles).
+cdf.cyto_schema
+
+# Lazily scan large Parquet datasets with predicate/projection pushdown.


this could be a new code block

gwaybio · 2026-06-21T14:47:34Z

+    CytoDataFrame.scan_parquet("profiles.parquet")
+    .filter(pl.col("Metadata_Well") == "A01")
+    .select_features()
+    .collect()  # -> CytoDataFrame


this comment could probably be clarified - what do you mean?

gwaybio · 2026-06-21T14:49:48Z

+
+```shell
+# interactive 3D volume rendering (trame / pyvista)
+pip install "cytodataframe[viz3d]"


this could be slightly annoying to someone who doesn't understand the full cytodataframe scope - they could, for example, think cytodataframe as a jupyter notebook visualization engine but then learn their install didn't include the functions for this - is there a way for us to direct someone who makes this mistake? Perhaps by including an error message or warning and instructions on how to obtain this functionality if they are using the core cytodataframe in a way that actually requires the specific additional optional dependences? (perhaps scope outside this pr)

migrate to polars

1fabd1e

[pre-commit.ci lite] apply automatic fixes

16e3977

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread README.md Outdated

d33bs added 2 commits June 19, 2026 12:09

address coderabbit review

f7a15d6

Merge branch 'migrate-to-polars' of https://github.com/d33bs/CytoData…

adfdbd1

…Frame into migrate-to-polars

d33bs marked this pull request as ready for review June 21, 2026 03:06

d33bs requested a review from jenna-tomkinson as a code owner June 21, 2026 03:06

d33bs requested a review from gwaybio June 21, 2026 03:13

gwaybio approved these changes Jun 21, 2026

View reviewed changes

		return isinstance(data, pa.Table)


		def is_supported(data: Any) -> bool:

Uh oh!

Conversation

d33bs commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What kind of change(s) are included?

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

d33bs commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d33bs commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwaybio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d33bs commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading