Skip to content

Experimental encoding of provenance multiblock files as parquet#568

Draft
dhirving wants to merge 4 commits into
mainfrom
u/dhirving/try-multiblock-as-parquet-variant
Draft

Experimental encoding of provenance multiblock files as parquet#568
dhirving wants to merge 4 commits into
mainfrom
u/dhirving/try-multiblock-as-parquet-variant

Conversation

@dhirving

@dhirving dhirving commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

This is a quick experiment to try writing the provenance multiblock files as parquet instead of a custom binary format. It's using the new Variant column type to work with the JSON data previously stored in the multiblock files. (These would be better as hand-crafted parquet schemas, but that would take a lot longer to try.)

So far it gives about 3x better compression for the dataset and quanta files. And the query required by DM-55322 becomes a one-liner that executes in 15 seconds against ~23 million rows.

Haven't tested random access yet, but the theory is that we will sort the files and just use the row group indexes to do the lookups by UUID.

Doesn't work at all for metadata and logs -- those are large blobs that don't really make sense in parquet. The files get larger and its hard to get the parquet encoder to not run out of memory when writing them.

Checklist

  • ran Jenkins
  • ran and inspected package-docs build
  • added a release note for user-visible changes to doc/changes

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.60%. Comparing base (9315b27) to head (dad4ab2).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
python/lsst/pipe/base/quantum_graph/_convert.py 0.00% 37 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #568      +/-   ##
==========================================
- Coverage   88.81%   88.60%   -0.21%     
==========================================
  Files         160      161       +1     
  Lines       22357    22394      +37     
  Branches     2657     2663       +6     
==========================================
- Hits        19856    19842      -14     
- Misses       1861     1906      +45     
- Partials      640      646       +6     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant