Skip to content

[Bug] Parquet reader can fail seeking past known repdefs for repeated columns #624

Description

@oarap

Component Selection

  • Core Engine (Expression eval, Memory, Vector)
  • Connectors / File Formats (Hive, Parquet, etc.)
  • API / Bindings (Python, etc.)
  • Build
  • Other

Describe the Bug

The native Parquet reader can fail when reading non-top-level repeated columns with chunk-level repetition/definition levels.

The failure happens when the value reader advances to a later physical data page before the corresponding repeated/definition-level page metadata has been materialized into numLeavesInPage_.

In this state, PageReader::setPageRowInfo() increments pageIndex_ and checks that the page index is already covered by numLeavesInPage_:

BOLT_CHECK_LT(
    pageIndex_,
    numLeavesInPage_.size(),
    "Seeking past known repdefs for non top level column page {}",
    pageIndex_);

However, this state can be recoverable. More rep/def metadata may already be staged in preloadedRepDefs_, but not yet decoded into numLeavesInPage_.

As a result, sparse/selective reads over repeated Parquet columns can throw:

Seeking past known repdefs for non top level column page N even though the reader already has pending rep/def batches available and could continue by materializing them.

Reproduction Steps

A focused unit test can reproduce the failure by constructing the relevant PageReader state directly:

  1. Create a non-top-level repeated leaf PageReader.
  2. Set chunk-level rep/defs state:
    • hasChunkRepDefs_ = true
    • pageIndex_ = 0
    • numLeavesInPage_ = {1}
  3. Add one pending rep/def batch to preloadedRepDefs_.
  4. Call setPageRowInfo(false).

Without the fix, setPageRowInfo(false) increments pageIndex_ to 1, sees that numLeavesInPage_.size() is still 1, and fails with:

(1 vs. 1) Seeking past known repdefs for non top level column page 1

The regression test could be added for this case is:

TEST_F(ParquetPageReaderTest, loadsPendingRepDefsBeforePageRowInfoCheck)

The issue can also be triggered by sparse/selective reads over repeated Parquet columns, especially when rep/def decoding is batched and the value reader reaches a later physical page while more rep/def metadata
remains pending in preloadedRepDefs_.

Targeted verification command:

cmake --build _build/Release --target bolt_dwio_parquet_reader_test

_build/Release/bolt/dwio/parquet/tests/reader/bolt_dwio_parquet_reader_test
--gtest_filter=ParquetPageReaderTest.loadsPendingRepDefsBeforePageRowInfoCheck

Observed result without the fix:

[ FAILED ] ParquetPageReaderTest.loadsPendingRepDefsBeforePageRowInfoCheck

Reason: (1 vs. 1) Seeking past known repdefs for non top level column page 1
Expression: pageIndex_ < numLeavesInPage_.size()
Function: setPageRowInfo
File: bolt/dwio/parquet/reader/PageReader.cpp

Bolt Version / Commit ID

main

System Configuration

- **OS**: (e.g. Ubuntu 22.04, CentOS 7)
- **Compiler**: (e.g. GCC 11, Clang 14)
- **Build Type**: (Debug / Release / RelWithDebInfo)
- **CPU Arch**: (e.g. x86_64 AVX2, ARM64)
- **Framework**: (e.g. Spark 3.3, PrestoDB)

Logs / Stack Trace

Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (5 vs. 5) Seeking past known repdefs for non top level column page 5
Retriable: False
Expression: pageIndex_ < numLeavesInPage_.size()
Additional Context: Operator: TableScan[0] 0
Function: setPageRowInfo
File: bolt/dwio/parquet/reader/PageReader.cpp
Line: 265

Expected Behavior

When PageReader::setPageRowInfo() advances to a non-top-level repeated-column page whose metadata is not yet present in numLeavesInPage_, it should first check whether additional rep/def batches are pending
in preloadedRepDefs_.

If pending batches exist, the reader should materialize them by calling loadMoreRepDefs() before enforcing the bounds check.

The existing safety check should still remain in place. If no pending rep/def batches exist and pageIndex_ is still beyond numLeavesInPage_, the reader should continue to fail as before because that indicates
a true invalid seek or corrupted/inconsistent state.

Additional context

Root cause hypothesis:

numLeavesInPage_ tracks the number of leaf values decoded from rep/def metadata for each data page. With batched rep/def decoding, this vector may lag behind the physical data page reached by the value reader.

Sparse/selective scans can advance the value path across page boundaries while only a subset of rep/def page metadata has been decoded into numLeavesInPage_.

The important detail is that the missing metadata may already be available in preloadedRepDefs_. In that case, failing immediately is too strict. The reader should lazily decode pending rep/def batches until
either:

  1. pageIndex_ < numLeavesInPage_.size(), or
  2. there are no pending rep/def batches left.

The fix is small and preserves the original invariant:

while (pageIndex_ >= static_cast<int32_t>(numLeavesInPage_.size()) &&
       !preloadedRepDefs_.empty()) {
  loadMoreRepDefs();
}

BOLT_CHECK_LT(
    pageIndex_,
    numLeavesInPage_.size(),
    "Seeking past known repdefs for non top level column page {}",
    pageIndex_);

Validation:

• The focused regression test fails without the fix with the expected Seeking past known repdefs error.
• The same test passes with the fix.
• The fix only materializes already-preloaded rep/def batches and does not suppress the existing bounds check.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions