Migrate to polars#214
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a multi-backend data engine ( ChangesPolars/Arrow Backend and Lazy Query Layer
Sequence Diagram(s)sequenceDiagram
participant Caller
participant CytoDataFrame
participant engine
participant CytoLazyFrame
participant PolarsLazyFrame
participant CytoSchema
Caller->>CytoDataFrame: CytoDataFrame(polars_df / arrow_table)
CytoDataFrame->>engine: normalize_to_pandas(data)
engine-->>CytoDataFrame: pandas.DataFrame
Caller->>CytoDataFrame: to_lazy()
CytoDataFrame->>CytoLazyFrame: __init__(data, context=build_context(_custom_attrs))
CytoLazyFrame->>engine: to_lazyframe(data)
engine-->>CytoLazyFrame: polars.LazyFrame
Caller->>CytoLazyFrame: filter(...).select_features()
CytoLazyFrame->>CytoSchema: infer(lazyframe) → feature/metadata columns
CytoSchema-->>CytoLazyFrame: column buckets
CytoLazyFrame->>PolarsLazyFrame: .filter(...).select(cols)
PolarsLazyFrame-->>CytoLazyFrame: new LazyFrame
Caller->>CytoLazyFrame: collect()
CytoLazyFrame->>PolarsLazyFrame: .collect()
PolarsLazyFrame-->>CytoLazyFrame: polars.DataFrame
CytoLazyFrame->>CytoDataFrame: CytoDataFrame(result, **preserved_context)
CytoDataFrame-->>Caller: CytoDataFrame
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@README.md`:
- Line 28: The "Polars and Arrow interoperability" section heading uses three
hashes (###) which violates markdown heading hierarchy and triggers the MD001
rule. Change the heading from ### Polars and Arrow interoperability to ## Polars
and Arrow interoperability to maintain proper heading structure and compliance
with markdown linting rules.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bbce1cd8-fbc7-4b71-9822-e59555fe6b2b
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
.pre-commit-config.yamlREADME.mdpyproject.tomlsrc/cytodataframe/__init__.pysrc/cytodataframe/engine.pysrc/cytodataframe/frame.pysrc/cytodataframe/lazy.pysrc/cytodataframe/schema.pytests/test_engine.pytests/test_lazy.pytests/test_schema.py
|
@coderabbitai review |
✅ Action performedReview finished.
|
gwaybio
left a comment
There was a problem hiding this comment.
Very excited to see this development! I made several comments, mostly regarding naming and understandability. Happy Fathers Day!
| return isinstance(data, pa.Table) | ||
|
|
||
|
|
||
| def is_supported(data: Any) -> bool: |
There was a problem hiding this comment.
consider renaming to is_dataframe_engine_supported (or something like that) - specifying exactly what this is testing
| if isinstance(data, pd.Series): | ||
| data = data.to_frame() | ||
| if isinstance(data, pd.DataFrame): | ||
| # Strip any pandas subclass (e.g. CytoDataFrame) and index before handing |
There was a problem hiding this comment.
is this comment accurate? I don't see any manipulations below, just an exception?
| Convert any supported tabular input to a :class:`pyarrow.Table`. | ||
|
|
||
| Arrow is the canonical schema/serialization contract, so this is the | ||
| conversion used whenever schema or interchange guarantees matter. |
There was a problem hiding this comment.
"whenever schema or interchange guarantees matter" is terse - consider defining technical terms or using simpler more direct language
| """ | ||
| Lazily scan a Parquet file/dataset into a :class:`polars.LazyFrame`. | ||
|
|
||
| This enables predicate/projection pushdown for large profiling datasets |
There was a problem hiding this comment.
can you elaborate slightly on what "predicate/projection pushdown" is?
| .collect() | ||
| ) | ||
|
|
||
| It is intentionally a *separate* type from ``CytoDataFrame`` so that its |
| # Construct from pandas, polars (DataFrame or LazyFrame), or a pyarrow Table. | ||
| cdf = CytoDataFrame("profiles.parquet") | ||
|
|
||
| # Convert out to any representation (Pandas stays a boundary layer). |
There was a problem hiding this comment.
because this is in the main readme, consider defining what you mean by "boundary layer"
| cdf.to_lazy() # CytoLazyFrame (lazy, Polars-backed) | ||
|
|
||
| # Inspect the inferred schema (metadata / feature / geometry roles). | ||
| cdf.cyto_schema |
There was a problem hiding this comment.
consider previewing what this output looks like
| # Inspect the inferred schema (metadata / feature / geometry roles). | ||
| cdf.cyto_schema | ||
|
|
||
| # Lazily scan large Parquet datasets with predicate/projection pushdown. |
| CytoDataFrame.scan_parquet("profiles.parquet") | ||
| .filter(pl.col("Metadata_Well") == "A01") | ||
| .select_features() | ||
| .collect() # -> CytoDataFrame |
There was a problem hiding this comment.
this comment could probably be clarified - what do you mean?
|
|
||
| ```shell | ||
| # interactive 3D volume rendering (trame / pyvista) | ||
| pip install "cytodataframe[viz3d]" |
There was a problem hiding this comment.
this could be slightly annoying to someone who doesn't understand the full cytodataframe scope - they could, for example, think cytodataframe as a jupyter notebook visualization engine but then learn their install didn't include the functions for this - is there a way for us to direct someone who makes this mistake? Perhaps by including an error message or warning and instructions on how to obtain this functionality if they are using the core cytodataframe in a way that actually requires the specific additional optional dependences? (perhaps scope outside this pr)
Description
What kind of change(s) are included?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.
Summary by CodeRabbit
Release Notes