Skip to content

Increase OME-Arrow and iceberg performance and scaling#465

Merged
d33bs merged 4 commits into
cytomining:mainfrom
d33bs:ome-arrow-update
Jun 11, 2026
Merged

Increase OME-Arrow and iceberg performance and scaling#465
d33bs merged 4 commits into
cytomining:mainfrom
d33bs:ome-arrow-update

Conversation

@d33bs

@d33bs d33bs commented Jun 10, 2026

Copy link
Copy Markdown
Member

Description

The previous iteration of warehouse exports from CytoTable are not efficient with memory use, sometimes failing due to overuse. This PR seeks to follow existing CytoTable conventions for warehouse exports, using pagination over batches and Parsl for multithreaded behavior.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

Summary by CodeRabbit

  • New Features

    • The drop_null option is now available for Iceberg backend exports, providing consistent null-value filtering behavior across all supported export formats.
  • Performance

    • Optimized Iceberg export operations for improved efficiency.
    • Enhanced image processing and deduplication performance.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@d33bs, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 2 minutes and 19 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e19e0879-57e9-4137-a1ad-8b254a558b3b

📥 Commits

Reviewing files that changed from the base of the PR and between 679dedf and 6b91fa6.

📒 Files selected for processing (4)
  • cytotable/utils.py
  • cytotable/warehouse/iceberg.py
  • cytotable/warehouse/images.py
  • tests/test_iceberg.py
📝 Walkthrough

Walkthrough

This PR optimizes Iceberg export in CytoTable by forwarding drop_null parameter configuration through the workflow, streaming profiles in batches, deduplicating images using Arrow compute instead of pandas, and vectorizing image export preparation to reduce iteration overhead.

Changes

Iceberg Export Optimization and drop_null Parameter Support

Layer / File(s) Summary
drop_null Parameter Forwarding
cytotable/convert.py, cytotable/warehouse/iceberg.py, tests/test_convert.py
write_iceberg_warehouse now accepts drop_null parameter and forwards it to both profiles and image-join export workflows, matching parquet workflow behavior. convert() function passes the parameter through from its public API, with test verification.
SQL Query String Formatting
cytotable/convert.py
Minor f-string quoting and query construction adjustments in pageset-to-parquet export and DuckDB metadata write paths preserve filtering and ordering semantics.
Profiles Batch Streaming
cytotable/warehouse/iceberg.py
Imports pyarrow.compute and introduces _PROFILE_WRITE_BATCH_ROWS constant. Profiles Parquet is now streamed in row-group batches with lazy Iceberg table creation from the first batch schema and sequential appends, replacing whole-file materialization.
Image Deduplication with Arrow Compute
cytotable/warehouse/iceberg.py
Scans only the Metadata_ImageID column from existing source-images table. Filters deduplicated rows using Arrow compute is_in/invert operations on typed Arrow arrays instead of pandas DataFrame.isin, then updates seen_source_image_ids from the filtered result.
Image Export Preparation Vectorization
cytotable/warehouse/images.py
Object ID generation uses DataFrame.apply instead of iterrows loop. Bbox filtering vectorized with numeric coercion and span constraints. Metadata_ObjectID computed only when absent. Image expansion uses DataFrame.melt with source_image_file normalization and null filtering, with early returns if no valid rows remain.
View Row Count Optimization
cytotable/warehouse/iceberg.py
describe_iceberg_warehouse no longer materializes view row counts; rows field is set to None for views.

🎯 3 (Moderate) | ⏱️ ~25 minutes


🐰 The warehouse hops with Arrow compute's grace,
Batching profiles at a streaming pace,
With nulls now flowing where they should be,
Images deduplicated, vectorized and free!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective: performance and scaling improvements for OME-Arrow and iceberg warehouse exports across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@d33bs

d33bs commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cytotable/warehouse/iceberg.py`:
- Around line 726-748: The loop-only table creation skips registering an empty
Iceberg table when the staged parquet has 0 rows; detect the empty-parquet case
(e.g. inspect pq_file.metadata.num_rows or check if pq_file.iter_batches yields
no batches) and if no rows were produced and profiles_iceberg_table is still
None, create the empty Iceberg table explicitly by calling
bundle.create_table((default_namespace, profiles_table_name), schema,
properties=_cytotable_iceberg_properties()) where schema is taken from the
parquet file (e.g. pq_file.schema_arrow or an equivalent Arrow schema derived
from the Parquet metadata) so that profiles_iceberg_table is set and
profiles_table_exists becomes True even for zero-row exports.

In `@cytotable/warehouse/images.py`:
- Around line 1052-1063: The current logic only generates object IDs when the
Metadata_ObjectID column is missing, leaving existing nulls untouched and
causing merge failures; update the code around the handling of frame and
Metadata_ObjectID so that you compute and fill missing/null Metadata_ObjectID
values (not just when the column is absent) by applying _build_stable_object_id
(using _extract_key_fields and _validated_bbox_values with bbox_columns) only
for rows where frame["Metadata_ObjectID"].isna() (or similar), and then assign
those generated IDs back into frame["Metadata_ObjectID"]; make the same change
at the other occurrence that currently checks column presence (the blocks
referencing Metadata_ObjectID generation and the merge key usage) so no null IDs
remain before the merge.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 091f1a65-d0cb-418f-987b-82570f0e3bb2

📥 Commits

Reviewing files that changed from the base of the PR and between 99e8515 and 679dedf.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • cytotable/convert.py
  • cytotable/warehouse/iceberg.py
  • cytotable/warehouse/images.py
  • tests/test_convert.py

Comment thread cytotable/warehouse/iceberg.py
Comment thread cytotable/warehouse/images.py Outdated
@d33bs d33bs marked this pull request as ready for review June 10, 2026 20:48
@d33bs d33bs requested a review from gwaybio as a code owner June 10, 2026 20:48
@d33bs

d33bs commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Thanks @gwaybio !

@d33bs d33bs merged commit b0f3907 into cytomining:main Jun 11, 2026
11 checks passed
@d33bs d33bs deleted the ome-arrow-update branch June 11, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants