Skip to content

Migrate to polars#214

Open
d33bs wants to merge 4 commits into
cytomining:mainfrom
d33bs:migrate-to-polars
Open

Migrate to polars#214
d33bs wants to merge 4 commits into
cytomining:mainfrom
d33bs:migrate-to-polars

Conversation

@d33bs

@d33bs d33bs commented Jun 19, 2026

Copy link
Copy Markdown
Member

Description

What kind of change(s) are included?

  • Documentation (changes docs or other related content)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • These changes pass all pre-commit checks.
  • I have added comments to my code to help provide understanding
  • I have added a test which covers the code changes found within this PR
  • I have deleted all non-relevant text in this pull request template.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added seamless Polars/Arrow interoperability, including conversion methods across eager and lazy forms.
    • Introduced lazy query support for filtering, selecting features, grouping, joining, and collecting results back into the main data frame type.
    • Added an Arrow-native schema system for deterministic column classification, validation, and struct shaping helpers.
    • Enabled lazy Parquet scanning for pipeline-friendly workflows.
  • Documentation
    • Expanded README with interoperability details and clearer optional installation extras.
  • Tests
    • Added coverage for engine conversions, lazy execution, schema inference/validation, and Parquet pipelines.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4172ab5-1607-4496-8d72-f4dc23945e8a

📥 Commits

Reviewing files that changed from the base of the PR and between 16e3977 and adfdbd1.

📒 Files selected for processing (1)
  • README.md
✅ Files skipped from review due to trivial changes (1)
  • README.md

📝 Walkthrough

Walkthrough

Adds a multi-backend data engine (engine.py) normalizing pandas, polars, and pyarrow inputs; a CytoLazyFrame lazy query wrapper carrying image/display context through Polars operations; and a CytoSchema column classifier with Arrow struct helpers. CytoDataFrame is extended with new constructor branches and public interchange/lazy APIs. Packaging, docs, and test coverage are updated throughout.

Changes

Polars/Arrow Backend and Lazy Query Layer

Layer / File(s) Summary
Engine abstraction: TabularData type and conversions
src/cytodataframe/engine.py
Defines TabularData union, lazy optional imports, runtime type predicates (is_polars_dataframe, is_arrow_table, etc.), and all format-conversion functions (to_pandas, to_polars, to_lazyframe, to_arrow, normalize_to_pandas, scan_parquet, read_parquet).
CytoSchema: column classification and Arrow struct helpers
src/cytodataframe/schema.py
Adds CytoSchema dataclass with regex-based column bucketing into metadata/feature/geometry/image lists, dispatch inference from pandas/polars/arrow, validate/require/to_dict APIs, and add_bbox_struct/add_centroid_struct nested Arrow struct helpers.
CytoLazyFrame: lazy query wrapper with context carry-through
src/cytodataframe/lazy.py
Adds CytoLazyGroupBy and CytoLazyFrame wrapping polars.LazyFrame, forwarding lazy ops (filter, select, join, group_by, select_features) while preserving image/display context, and materializing via collect/to_polars/to_arrow/to_pandas. Adds build_context, scan_parquet, and from_sequence_context helpers.
CytoDataFrame constructor normalization and interchange API
src/cytodataframe/frame.py
Extends __init__ to detect and normalize polars/arrow inputs via engine.normalize_to_pandas; adds public to_pandas, to_polars, to_lazy, to_arrow, cyto_schema, from_file, and scan_parquet methods.
Package wiring and dependency restructuring
src/cytodataframe/__init__.py, pyproject.toml, .pre-commit-config.yaml
Exports CytoLazyFrame, CytoSchema, and engine via __all__; moves heavy deps to optional extras, adds polars as a base dependency, expands dev group with hypothesis/pyvista/trame, and bumps pyproject-fmt and ruff-pre-commit versions.
Tests and documentation
tests/test_engine.py, tests/test_lazy.py, tests/test_schema.py, README.md
Adds engine conversion/round-trip tests, lazy pipeline tests (filter, select_features, group_by, join, context carry-through, parquet scan), schema inference/validation/Hypothesis partitioning/struct-helper tests, and README sections on Polars/Arrow interop and optional extras.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant CytoDataFrame
    participant engine
    participant CytoLazyFrame
    participant PolarsLazyFrame
    participant CytoSchema

    Caller->>CytoDataFrame: CytoDataFrame(polars_df / arrow_table)
    CytoDataFrame->>engine: normalize_to_pandas(data)
    engine-->>CytoDataFrame: pandas.DataFrame

    Caller->>CytoDataFrame: to_lazy()
    CytoDataFrame->>CytoLazyFrame: __init__(data, context=build_context(_custom_attrs))
    CytoLazyFrame->>engine: to_lazyframe(data)
    engine-->>CytoLazyFrame: polars.LazyFrame

    Caller->>CytoLazyFrame: filter(...).select_features()
    CytoLazyFrame->>CytoSchema: infer(lazyframe) → feature/metadata columns
    CytoSchema-->>CytoLazyFrame: column buckets
    CytoLazyFrame->>PolarsLazyFrame: .filter(...).select(cols)
    PolarsLazyFrame-->>CytoLazyFrame: new LazyFrame

    Caller->>CytoLazyFrame: collect()
    CytoLazyFrame->>PolarsLazyFrame: .collect()
    PolarsLazyFrame-->>CytoLazyFrame: polars.DataFrame
    CytoLazyFrame->>CytoDataFrame: CytoDataFrame(result, **preserved_context)
    CytoDataFrame-->>Caller: CytoDataFrame
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • jenna-tomkinson

Poem

🐇 Hop, hop through frames so lazy and lean,
Arrow and Polars make columns gleam!
CytoSchema sorts each column with care,
Metadata here, features over there.
Collect it all — the rabbit's dream! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 67.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Migrate to polars' directly describes the primary objective of the pull request, which involves migrating the CytoDataFrame codebase to use Polars as the execution engine.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@d33bs

d33bs commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 28: The "Polars and Arrow interoperability" section heading uses three
hashes (###) which violates markdown heading hierarchy and triggers the MD001
rule. Change the heading from ### Polars and Arrow interoperability to ## Polars
and Arrow interoperability to maintain proper heading structure and compliance
with markdown linting rules.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bbce1cd8-fbc7-4b71-9822-e59555fe6b2b

📥 Commits

Reviewing files that changed from the base of the PR and between e8ad57b and 16e3977.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • .pre-commit-config.yaml
  • README.md
  • pyproject.toml
  • src/cytodataframe/__init__.py
  • src/cytodataframe/engine.py
  • src/cytodataframe/frame.py
  • src/cytodataframe/lazy.py
  • src/cytodataframe/schema.py
  • tests/test_engine.py
  • tests/test_lazy.py
  • tests/test_schema.py

Comment thread README.md Outdated
@d33bs

d33bs commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@d33bs d33bs marked this pull request as ready for review June 21, 2026 03:06
@d33bs d33bs requested a review from jenna-tomkinson as a code owner June 21, 2026 03:06
@d33bs d33bs requested a review from gwaybio June 21, 2026 03:13

@gwaybio gwaybio left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very excited to see this development! I made several comments, mostly regarding naming and understandability. Happy Fathers Day!

return isinstance(data, pa.Table)


def is_supported(data: Any) -> bool:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider renaming to is_dataframe_engine_supported (or something like that) - specifying exactly what this is testing

if isinstance(data, pd.Series):
data = data.to_frame()
if isinstance(data, pd.DataFrame):
# Strip any pandas subclass (e.g. CytoDataFrame) and index before handing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment accurate? I don't see any manipulations below, just an exception?

Convert any supported tabular input to a :class:`pyarrow.Table`.

Arrow is the canonical schema/serialization contract, so this is the
conversion used whenever schema or interchange guarantees matter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"whenever schema or interchange guarantees matter" is terse - consider defining technical terms or using simpler more direct language

"""
Lazily scan a Parquet file/dataset into a :class:`polars.LazyFrame`.

This enables predicate/projection pushdown for large profiling datasets

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate slightly on what "predicate/projection pushdown" is?

Comment thread src/cytodataframe/lazy.py
.collect()
)

It is intentionally a *separate* type from ``CytoDataFrame`` so that its

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this comment!

Comment thread README.md
# Construct from pandas, polars (DataFrame or LazyFrame), or a pyarrow Table.
cdf = CytoDataFrame("profiles.parquet")

# Convert out to any representation (Pandas stays a boundary layer).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this is in the main readme, consider defining what you mean by "boundary layer"

Comment thread README.md
cdf.to_lazy() # CytoLazyFrame (lazy, Polars-backed)

# Inspect the inferred schema (metadata / feature / geometry roles).
cdf.cyto_schema

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider previewing what this output looks like

Comment thread README.md
# Inspect the inferred schema (metadata / feature / geometry roles).
cdf.cyto_schema

# Lazily scan large Parquet datasets with predicate/projection pushdown.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a new code block

Comment thread README.md
CytoDataFrame.scan_parquet("profiles.parquet")
.filter(pl.col("Metadata_Well") == "A01")
.select_features()
.collect() # -> CytoDataFrame

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment could probably be clarified - what do you mean?

Comment thread README.md

```shell
# interactive 3D volume rendering (trame / pyvista)
pip install "cytodataframe[viz3d]"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be slightly annoying to someone who doesn't understand the full cytodataframe scope - they could, for example, think cytodataframe as a jupyter notebook visualization engine but then learn their install didn't include the functions for this - is there a way for us to direct someone who makes this mistake? Perhaps by including an error message or warning and instructions on how to obtain this functionality if they are using the core cytodataframe in a way that actually requires the specific additional optional dependences? (perhaps scope outside this pr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants