Increase OME-Arrow and iceberg performance and scaling by d33bs · Pull Request #465 · cytomining/CytoTable

d33bs · 2026-06-10T19:43:59Z

Description

The previous iteration of warehouse exports from CytoTable are not efficient with memory use, sometimes failing due to overuse. This PR seeks to follow existing CytoTable conventions for warehouse exports, using pagination over batches and Parsl for multithreaded behavior.

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

Summary by CodeRabbit

New Features
- The drop_null option is now available for Iceberg backend exports, providing consistent null-value filtering behavior across all supported export formats.
Performance
- Optimized Iceberg export operations for improved efficiency.
- Enhanced image processing and deduplication performance.

coderabbitai · 2026-06-10T19:44:07Z

Warning

Review limit reached

@d33bs, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 2 minutes and 19 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e19e0879-57e9-4137-a1ad-8b254a558b3b

📥 Commits

Reviewing files that changed from the base of the PR and between 679dedf and 6b91fa6.

📒 Files selected for processing (4)

cytotable/utils.py
cytotable/warehouse/iceberg.py
cytotable/warehouse/images.py
tests/test_iceberg.py

📝 Walkthrough

Walkthrough

This PR optimizes Iceberg export in CytoTable by forwarding drop_null parameter configuration through the workflow, streaming profiles in batches, deduplicating images using Arrow compute instead of pandas, and vectorizing image export preparation to reduce iteration overhead.

Changes

Iceberg Export Optimization and drop_null Parameter Support

Layer / File(s)	Summary
drop_null Parameter Forwarding `cytotable/convert.py`, `cytotable/warehouse/iceberg.py`, `tests/test_convert.py`	`write_iceberg_warehouse` now accepts `drop_null` parameter and forwards it to both profiles and image-join export workflows, matching parquet workflow behavior. `convert()` function passes the parameter through from its public API, with test verification.
SQL Query String Formatting `cytotable/convert.py`	Minor f-string quoting and query construction adjustments in pageset-to-parquet export and DuckDB metadata write paths preserve filtering and ordering semantics.
Profiles Batch Streaming `cytotable/warehouse/iceberg.py`	Imports pyarrow.compute and introduces `_PROFILE_WRITE_BATCH_ROWS` constant. Profiles Parquet is now streamed in row-group batches with lazy Iceberg table creation from the first batch schema and sequential appends, replacing whole-file materialization.
Image Deduplication with Arrow Compute `cytotable/warehouse/iceberg.py`	Scans only the Metadata_ImageID column from existing source-images table. Filters deduplicated rows using Arrow compute is_in/invert operations on typed Arrow arrays instead of pandas DataFrame.isin, then updates seen_source_image_ids from the filtered result.
Image Export Preparation Vectorization `cytotable/warehouse/images.py`	Object ID generation uses DataFrame.apply instead of iterrows loop. Bbox filtering vectorized with numeric coercion and span constraints. Metadata_ObjectID computed only when absent. Image expansion uses DataFrame.melt with source_image_file normalization and null filtering, with early returns if no valid rows remain.
View Row Count Optimization `cytotable/warehouse/iceberg.py`	describe_iceberg_warehouse no longer materializes view row counts; rows field is set to None for views.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 The warehouse hops with Arrow compute's grace,
Batching profiles at a streaming pace,
With nulls now flowing where they should be,
Images deduplicated, vectorized and free! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main objective: performance and scaling improvements for OME-Arrow and iceberg warehouse exports across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

d33bs · 2026-06-10T19:50:57Z

@coderabbitai review

coderabbitai · 2026-06-10T19:51:03Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cytotable/warehouse/iceberg.py`:
- Around line 726-748: The loop-only table creation skips registering an empty
Iceberg table when the staged parquet has 0 rows; detect the empty-parquet case
(e.g. inspect pq_file.metadata.num_rows or check if pq_file.iter_batches yields
no batches) and if no rows were produced and profiles_iceberg_table is still
None, create the empty Iceberg table explicitly by calling
bundle.create_table((default_namespace, profiles_table_name), schema,
properties=_cytotable_iceberg_properties()) where schema is taken from the
parquet file (e.g. pq_file.schema_arrow or an equivalent Arrow schema derived
from the Parquet metadata) so that profiles_iceberg_table is set and
profiles_table_exists becomes True even for zero-row exports.

In `@cytotable/warehouse/images.py`:
- Around line 1052-1063: The current logic only generates object IDs when the
Metadata_ObjectID column is missing, leaving existing nulls untouched and
causing merge failures; update the code around the handling of frame and
Metadata_ObjectID so that you compute and fill missing/null Metadata_ObjectID
values (not just when the column is absent) by applying _build_stable_object_id
(using _extract_key_fields and _validated_bbox_values with bbox_columns) only
for rows where frame["Metadata_ObjectID"].isna() (or similar), and then assign
those generated IDs back into frame["Metadata_ObjectID"]; make the same change
at the other occurrence that currently checks column presence (the blocks
referencing Metadata_ObjectID generation and the merge key usage) so no null IDs
remain before the merge.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 091f1a65-d0cb-418f-987b-82570f0e3bb2

📥 Commits

Reviewing files that changed from the base of the PR and between 99e8515 and 679dedf.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

cytotable/convert.py
cytotable/warehouse/iceberg.py
cytotable/warehouse/images.py
tests/test_convert.py

…into ome-arrow-update

d33bs · 2026-06-11T16:10:34Z

Thanks @gwaybio !

ome-arrow and iceberg scaling

aa19280

[pre-commit.ci lite] apply automatic fixes

679dedf

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread cytotable/warehouse/iceberg.py

Comment thread cytotable/warehouse/images.py Outdated

d33bs added 2 commits June 10, 2026 14:16

address coderabbit review

047ab36

Merge branch 'ome-arrow-update' of https://github.com/d33bs/CytoTable …

6b91fa6

…into ome-arrow-update

d33bs marked this pull request as ready for review June 10, 2026 20:48

d33bs requested a review from gwaybio as a code owner June 10, 2026 20:48

gwaybio approved these changes Jun 10, 2026

View reviewed changes

d33bs merged commit b0f3907 into cytomining:main Jun 11, 2026
11 checks passed

d33bs deleted the ome-arrow-update branch June 11, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase OME-Arrow and iceberg performance and scaling#465

Increase OME-Arrow and iceberg performance and scaling#465
d33bs merged 4 commits into
cytomining:mainfrom
d33bs:ome-arrow-update

d33bs commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Uh oh!

d33bs commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

d33bs commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

d33bs commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What is the nature of your change?

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Uh oh!

d33bs commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

d33bs commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d33bs commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading