[Bug] Parquet reader can fail seeking past known repdefs for repeated columns

### Component Selection

- [ ] Core Engine (Expression eval, Memory, Vector)
- [x] Connectors / File Formats (Hive, Parquet, etc.)
- [ ] API / Bindings (Python, etc.)
- [ ] Build
- [ ] Other

### Describe the Bug

The native Parquet reader can fail when reading non-top-level repeated columns with chunk-level repetition/definition levels.

The failure happens when the value reader advances to a later physical data page before the corresponding repeated/definition-level page metadata has been materialized into `numLeavesInPage_`.

In this state, `PageReader::setPageRowInfo()` increments `pageIndex_` and checks that the page index is already covered by `numLeavesInPage_`:

  ```cpp
  BOLT_CHECK_LT(
      pageIndex_,
      numLeavesInPage_.size(),
      "Seeking past known repdefs for non top level column page {}",
      pageIndex_);
```

However, this state can be recoverable. More rep/def metadata may already be staged in `preloadedRepDefs_`, but not yet decoded into `numLeavesInPage_`.

As a result, sparse/selective reads over repeated Parquet columns can throw:

Seeking past known repdefs for non top level column page N even though the reader already has pending rep/def batches available and could continue by materializing them.

### Reproduction Steps

A focused unit test can reproduce the failure by constructing the relevant `PageReader` state directly:

  1. Create a non-top-level repeated leaf `PageReader`.
  2. Set chunk-level rep/defs state:
     - `hasChunkRepDefs_ = true`
     - `pageIndex_ = 0`
     - `numLeavesInPage_ = {1}`
  3. Add one pending rep/def batch to `preloadedRepDefs_`.
  4. Call `setPageRowInfo(false)`.

  Without the fix, `setPageRowInfo(false)` increments `pageIndex_` to `1`, sees that `numLeavesInPage_.size()` is still `1`, and fails with:


  (1 vs. 1) Seeking past known repdefs for non top level column page 1

  The regression test could be added for this case is:

  TEST_F(ParquetPageReaderTest, loadsPendingRepDefsBeforePageRowInfoCheck)

  The issue can also be triggered by sparse/selective reads over repeated Parquet columns, especially when rep/def decoding is batched and the value reader reaches a later physical page while more rep/def metadata
  remains pending in preloadedRepDefs_.

  Targeted verification command:

  cmake --build _build/Release --target bolt_dwio_parquet_reader_test

  _build/Release/bolt/dwio/parquet/tests/reader/bolt_dwio_parquet_reader_test \
    --gtest_filter=ParquetPageReaderTest.loadsPendingRepDefsBeforePageRowInfoCheck

  Observed result without the fix:

  [  FAILED  ] ParquetPageReaderTest.loadsPendingRepDefsBeforePageRowInfoCheck

  Reason: (1 vs. 1) Seeking past known repdefs for non top level column page 1
  Expression: pageIndex_ < numLeavesInPage_.size()
  Function: setPageRowInfo
  File: bolt/dwio/parquet/reader/PageReader.cpp

### Bolt Version / Commit ID

main

### System Configuration

```markdown
- **OS**: (e.g. Ubuntu 22.04, CentOS 7)
- **Compiler**: (e.g. GCC 11, Clang 14)
- **Build Type**: (Debug / Release / RelWithDebInfo)
- **CPU Arch**: (e.g. x86_64 AVX2, ARM64)
- **Framework**: (e.g. Spark 3.3, PrestoDB)
```

### Logs / Stack Trace

```shell
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (5 vs. 5) Seeking past known repdefs for non top level column page 5
Retriable: False
Expression: pageIndex_ < numLeavesInPage_.size()
Additional Context: Operator: TableScan[0] 0
Function: setPageRowInfo
File: bolt/dwio/parquet/reader/PageReader.cpp
Line: 265
```

### Expected Behavior

  When `PageReader::setPageRowInfo()` advances to a non-top-level repeated-column page whose metadata is not yet present in `numLeavesInPage_`, it should first check whether additional rep/def batches are pending
  in `preloadedRepDefs_`.

  If pending batches exist, the reader should materialize them by calling `loadMoreRepDefs()` before enforcing the bounds check.

  The existing safety check should still remain in place. If no pending rep/def batches exist and `pageIndex_` is still beyond `numLeavesInPage_`, the reader should continue to fail as before because that indicates
  a true invalid seek or corrupted/inconsistent state.

### Additional context

  Root cause hypothesis:

  `numLeavesInPage_` tracks the number of leaf values decoded from rep/def metadata for each data page. With batched rep/def decoding, this vector may lag behind the physical data page reached by the value reader.

  Sparse/selective scans can advance the value path across page boundaries while only a subset of rep/def page metadata has been decoded into `numLeavesInPage_`.

  The important detail is that the missing metadata may already be available in `preloadedRepDefs_`. In that case, failing immediately is too strict. The reader should lazily decode pending rep/def batches until
  either:

  1. `pageIndex_ < numLeavesInPage_.size()`, or
  2. there are no pending rep/def batches left.

  The fix is small and preserves the original invariant:

  ```cpp
  while (pageIndex_ >= static_cast<int32_t>(numLeavesInPage_.size()) &&
         !preloadedRepDefs_.empty()) {
    loadMoreRepDefs();
  }

  BOLT_CHECK_LT(
      pageIndex_,
      numLeavesInPage_.size(),
      "Seeking past known repdefs for non top level column page {}",
      pageIndex_);
```

  Validation:

  • The focused regression test fails without the fix with the expected Seeking past known repdefs error.
  • The same test passes with the fix.
  • The fix only materializes already-preloaded rep/def batches and does not suppress the existing bounds check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Parquet reader can fail seeking past known repdefs for repeated columns #624

Component Selection

Describe the Bug

Reproduction Steps

Bolt Version / Commit ID

System Configuration

Logs / Stack Trace

Expected Behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Parquet reader can fail seeking past known repdefs for repeated columns #624

Description

Component Selection

Describe the Bug

Reproduction Steps

Bolt Version / Commit ID

System Configuration

Logs / Stack Trace

Expected Behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions