Experimental encoding of provenance multiblock files as parquet by dhirving · Pull Request #568 · lsst/pipe_base

dhirving · 2026-06-26T23:16:58Z

This is a quick experiment to try writing the provenance multiblock files as parquet instead of a custom binary format. It's using the new Variant column type to work with the JSON data previously stored in the multiblock files. (These would be better as hand-crafted parquet schemas, but that would take a lot longer to try.)

So far it gives about 3x better compression for the dataset and quanta files. And the query required by DM-55322 becomes a one-liner that executes in 15 seconds against ~23 million rows.

Haven't tested random access yet, but the theory is that we will sort the files and just use the row group indexes to do the lookups by UUID.

Doesn't work at all for metadata and logs -- those are large blobs that don't really make sense in parquet. The files get larger and its hard to get the parquet encoder to not run out of memory when writing them.

Checklist

ran Jenkins
ran and inspected package-docs build
added a release note for user-visible changes to doc/changes

codecov · 2026-06-26T23:20:38Z

Codecov Report

❌ Patch coverage is 0% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.60%. Comparing base (9315b27) to head (dad4ab2).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
python/lsst/pipe/base/quantum_graph/_convert.py	0.00%	37 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #568      +/-   ##
==========================================
- Coverage   88.81%   88.60%   -0.21%     
==========================================
  Files         160      161       +1     
  Lines       22357    22394      +37     
  Branches     2657     2663       +6     
==========================================
- Hits        19856    19842      -14     
- Misses       1861     1906      +45     
- Partials      640      646       +6

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

dhirving added 4 commits June 26, 2026 11:59

Add conversion of multiblock to parquet

2e8a29a

Augment output with UUID column and sort by it

4ef1fa1

add runner script for conversion

ffb4243

prevent OOM with larger files

dad4ab2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental encoding of provenance multiblock files as parquet#568

Experimental encoding of provenance multiblock files as parquet#568
dhirving wants to merge 4 commits into
mainfrom
u/dhirving/try-multiblock-as-parquet-variant

dhirving commented Jun 26, 2026 •

edited by atlassian Bot

Loading

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dhirving commented Jun 26, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhirving commented Jun 26, 2026 •

edited by atlassian Bot

Loading

codecov Bot commented Jun 26, 2026 •

edited

Loading