Increase OME-Arrow and iceberg performance and scaling#465
Conversation
|
Warning Review limit reached
More reviews will be available in 2 minutes and 19 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR optimizes Iceberg export in CytoTable by forwarding drop_null parameter configuration through the workflow, streaming profiles in batches, deduplicating images using Arrow compute instead of pandas, and vectorizing image export preparation to reduce iteration overhead. ChangesIceberg Export Optimization and drop_null Parameter Support
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cytotable/warehouse/iceberg.py`:
- Around line 726-748: The loop-only table creation skips registering an empty
Iceberg table when the staged parquet has 0 rows; detect the empty-parquet case
(e.g. inspect pq_file.metadata.num_rows or check if pq_file.iter_batches yields
no batches) and if no rows were produced and profiles_iceberg_table is still
None, create the empty Iceberg table explicitly by calling
bundle.create_table((default_namespace, profiles_table_name), schema,
properties=_cytotable_iceberg_properties()) where schema is taken from the
parquet file (e.g. pq_file.schema_arrow or an equivalent Arrow schema derived
from the Parquet metadata) so that profiles_iceberg_table is set and
profiles_table_exists becomes True even for zero-row exports.
In `@cytotable/warehouse/images.py`:
- Around line 1052-1063: The current logic only generates object IDs when the
Metadata_ObjectID column is missing, leaving existing nulls untouched and
causing merge failures; update the code around the handling of frame and
Metadata_ObjectID so that you compute and fill missing/null Metadata_ObjectID
values (not just when the column is absent) by applying _build_stable_object_id
(using _extract_key_fields and _validated_bbox_values with bbox_columns) only
for rows where frame["Metadata_ObjectID"].isna() (or similar), and then assign
those generated IDs back into frame["Metadata_ObjectID"]; make the same change
at the other occurrence that currently checks column presence (the blocks
referencing Metadata_ObjectID generation and the merge key usage) so no null IDs
remain before the merge.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 091f1a65-d0cb-418f-987b-82570f0e3bb2
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
cytotable/convert.pycytotable/warehouse/iceberg.pycytotable/warehouse/images.pytests/test_convert.py
…into ome-arrow-update
|
Thanks @gwaybio ! |
Description
The previous iteration of warehouse exports from CytoTable are not efficient with memory use, sometimes failing due to overuse. This PR seeks to follow existing CytoTable conventions for warehouse exports, using pagination over batches and Parsl for multithreaded behavior.
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.
Summary by CodeRabbit
New Features
drop_nulloption is now available for Iceberg backend exports, providing consistent null-value filtering behavior across all supported export formats.Performance