Skip to content

fix: count distinct top-level fields against dataSkippingNumIndexedCols#4466

Draft
rtyler wants to merge 3 commits into
delta-io:mainfrom
rtyler:fix/data-skipping-3172
Draft

fix: count distinct top-level fields against dataSkippingNumIndexedCols#4466
rtyler wants to merge 3 commits into
delta-io:mainfrom
rtyler:fix/data-skipping-3172

Conversation

@rtyler
Copy link
Copy Markdown
Member

@rtyler rtyler commented May 19, 2026

The per-file stats budget (default 32) was being consumed one parquet
leaf at a time, so a single top-level column with many nested fields
exhausted the budget and starved later top-level columns of stats.

This sloptastic 🦜 PR attempts to match the Delta/Spark
implementation by using N distinct top-level fields and
collecting stats for every leaf under them; partition columns no
longer consume a slot.

The negative test case was added first and then I validated that the
tests fail in accordance with the suggestions made by posters in the
original issue

Fixes #3172

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

@rtyler rtyler added binding/rust Issues for the Rust crate ai The issue or task is AI generated or contributed. labels May 19, 2026
@rtyler rtyler force-pushed the fix/data-skipping-3172 branch from 38bf806 to d1de4ae Compare May 19, 2026 22:19
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 93.82716% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.91%. Comparing base (36bf487) to head (756c999).

Files with missing lines Patch % Lines
crates/core/src/writer/stats.rs 93.82% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4466      +/-   ##
==========================================
+ Coverage   78.89%   78.91%   +0.01%     
==========================================
  Files         173      173              
  Lines       59697    59777      +80     
  Branches    59697    59777      +80     
==========================================
+ Hits        47096    47171      +75     
- Misses       9983     9988       +5     
  Partials     2618     2618              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rtyler and others added 2 commits May 19, 2026 16:43
Signed-off-by: R Tyler Croy <rtyler@brokenco.de>
The per-file stats budget (default 32) was being consumed one parquet
leaf at a time, so a single top-level column with many nested fields
exhausted the budget and starved later top-level columns of stats.

This sloptastic 🦜 PR attempts to match the Delta/Spark
implementation by using N distinct top-level fields and
collecting stats for every leaf under them; partition columns no
longer consume a slot.

The negative test case was added first and then I validated that the
tests fail in accordance with the suggestions made by posters in the
original issue

Fixes delta-io#3172

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: R Tyler Croy <rtyler@brokenco.de>
@rtyler rtyler force-pushed the fix/data-skipping-3172 branch from d1de4ae to c93681d Compare May 20, 2026 13:10
@github-actions github-actions Bot added the binding/python Issues for the Python package label May 20, 2026
See delta-io#3172

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai The issue or task is AI generated or contributed. binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

nested fields count towards limit of stats calculation of first 32 columns

1 participant