Skip to content

feat(selective): SIMD filterCache evaluation for dictionary index reading with filters (#17954)#1433

Open
prestodb-ci wants to merge 25 commits into
oss-baselinefrom
staging-1777c7523-pr
Open

feat(selective): SIMD filterCache evaluation for dictionary index reading with filters (#17954)#1433
prestodb-ci wants to merge 25 commits into
oss-baselinefrom
staging-1777c7523-pr

Conversation

@prestodb-ci

@prestodb-ci prestodb-ci commented Nov 25, 2025

Copy link
Copy Markdown
Collaborator

Test PR for branch staging-rebase-pr with head e06dd2b

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci prestodb-ci deleted the staging-1777c7523-pr branch November 26, 2025 01:29
Signed-off-by: Yuan <yuanzhou@apache.org>

Set ccache maximum size to 1G

Remove sed command from Gluten workflow

Removed a sed command that replaces 'oap-project' with 'IBM' in the get-velox.sh script.

Modify get-velox.sh to change 'ibm' to 'ibm-xxx'

Update the get-velox.sh script to replace 'ibm' with 'ibm-xxx'.

Update sed command to be case-insensitive

Update gluten.yml

fix iceberg unit test

Signed-off-by: Yuan <yuanzhou@apache.org>

Update gluten.yml

Enable enhanced features in gluten build script

Update cache keys for Gluten workflow
@FelixYBW FelixYBW restored the staging-1777c7523-pr branch November 27, 2025 09:14
@FelixYBW FelixYBW reopened this Nov 27, 2025
@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@FelixYBW FelixYBW changed the title fix: Avoid TSAN data race during cache entry initialization (#15623) refactor: Extract common BaseSerializedPage API (#15626) Dec 3, 2025
@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci prestodb-ci changed the title refactor: Extract common BaseSerializedPage API (#15626) fix(build): Ambiguity caused by long literal (#15670) Dec 3, 2025
@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

Huameng (Michael) Jiang and others added 2 commits June 27, 2026 01:05
…ubator#17940)

Summary:
Pull Request resolved: facebookincubator#17940

X-link: facebookincubator/nimble#911

Hardens the dictionary visitor read path against cross-chunk null-handling bugs surfaced by multi-chunk fuzzing, plus a clarity rename of one helper and an aarch64 build fix. Sits on top of D108709293 (RLEFix) in the dictionary vector stack.

## Bug fix: stale result-null bits leak across buffer-reused reads

When a chunk is nullable and the output row set is sparse/compacted (`returnReaderNulls_ == false`), the dictionary-index read path cleared `resultNulls_` only lazily -- it touched the buffer only when it actually emitted a null. A read whose surviving rows are all non-null therefore never cleared `resultNulls_`, so a stale null bit left over from a prior (buffer-reused) read leaked into the result vector and corrupted the output.

Fix: prepare the result-null buffer eagerly in the sparse dictionary-index reader instead of on the first emitted null. `readSparseMaterializedIndices` now clears `resultNulls_` up front whenever the read range carries nulls (`rawNulls != nullptr`), so an all-non-null read still starts from a clean buffer.

The prep lives in the reader that sees `rawNulls`, not in a single `NullableEncoding` chokepoint, because output nulls have two sources: the encoding's own nulls (via `NullableEncoding`) and incoming/inMap nulls on a non-nullable column nested under a nullable parent (e.g. a flat-map dict child with inMap nulls). The latter has no `NullableEncoding` in its read path, so a chokepoint there would never run for it; `rawNulls` is set for both, matching how the dense and filtered index readers already prepare their buffers.

## Rename: `initReturnReaderNulls` -> `setReturnNullsMode`

The helper only sets which buffer `resultNulls()` returns for the read -- it sets the `returnReaderNulls_` flag (and `anyNulls_`) and allocates nothing. The old name implied buffer initialization/allocation. Renamed across the Velox `SelectiveColumnReader` API and all Nimble call sites (legacy and non-legacy `NullableEncoding`, `ChunkedDecoder`, the `ReadWithVisitorParams` field in `Encoding.h`, `ReadWithVisitorTest`, `EncodingBench`).

## Build fix: bucket_benchmark on aarch64

`dwio/nimble/encodings/tests:bucket_benchmark` hardcoded x86-only tuning flags (`-mavx`, `-mavx2`, `-mbmi`, `-mbmi2`, `-mclzero`, `-mlzcnt`) in `compiler_flags`, which clang rejects for `aarch64-redhat-linux-gnu`, so the target failed to build on aarch64. The benchmark source uses no raw x86 intrinsics, so the flags are now arch-gated with `select()` (applied only on `ovr_config//cpu:x86_64`) and the portable source builds on aarch64 without them. Pre-existing breakage surfaced by this diff rebuilding the shared encoding headers.

## Test coverage: exercise cross-chunk resume paths

`E2EFilterTest` previously wrote a single chunk per stream, so the cross-chunk reader resume paths (where the above bug lives) were never exercised by standalone tests. Added two helpers and applied them across the dictionary / mainly-constant / RLE / nullable / fuzz cases:
- `applyMultiChunkOptions(VeloxWriterOptions&)` -- enable chunking, zero `minStreamChunkRawSize`, and a flush policy that never flushes the stripe but chunks at every `write()` boundary.
- `writeInChunks(writer, batch, numChunks=3)` -- slice a batch into N row-slices so each emits its own chunk (a single `write()` yields only one chunk).

Note: touches Velox-owned `velox/dwio/common/SelectiveColumnReader.{h,cpp}` (rename only) -- needs Velox-owner sign-off before landing.

Reviewed By: xiaoxmeng

Differential Revision: D108873522

fbshipit-source-id: 2c05c2f2d4f64b8ea20142ac99dc4b60b7014cec
… breakdown (facebookincubator#17943)

Summary:
Pull Request resolved: facebookincubator#17943

Adds 11 new runtime stats to RPCOperator, recorded at `close()` for per-operator visibility into RPC transport behavior.

**Congestion window:**
- `rpcCongestionWindowFinal`: final window limit at close
- `rpcCongestionShrinks`: total shrink events (onError halving + gradient shrinks)
- `rpcBaselineRttNanos`: learned baseline RTT (unloaded)
- `rpcPeakInFlight`: high-water mark of concurrent in-flight units

**Transport RTT:**
- `rpcRttMinWallNanos` / `rpcRttMaxWallNanos`: min/max round-trip across all completed units
- `rpcRttCount`: number of RTT samples

**Error-kind breakdown:**
- `rpcErrorKindRateLimited` / `rpcErrorKindTimeout` / `rpcErrorKindBackendError`: per-kind error counts (complement the existing aggregate `rpcErrorCount`)

**Mode:**
- `rpcStreamingMode`: 0 = PER_ROW, 1 = BATCH

**Aggregation fix:** registers 6 non-additive stats (window, peak, baseline, min/max RTT, mode) in `shouldAggregateRuntimeMetric` so they get `.aggregate()` (count/min/max) instead of being summed across drivers.

**CongestionController accessors:** adds `baselineRttNs()`, `numShrinks()` public accessors and `numShrinks_` private member so `RPCState::operatorSnapshot()` can read the controller state under its own mutex.

**V2 review fixes:** renamed `shrinkCount_` to `numShrinks_` and `rttCount_` to `numRttSamples_` (Velox `numXxx` convention); fixed asymmetric shrink counting in `onError()` (only counts when window actually decreases); extracted duplicated error-kind switch into `recordErrorKind()` helper; listed all `RPCErrorKind` enum values explicitly (fixes `clang-diagnostic-switch-enum`); combined `congestionSnapshot()` + `transportSnapshot()` + `streamingMode()` into a single `operatorSnapshot() const` method (one lock acquisition, consistent snapshot); gated `kRpcRttCount` emission on `numRttSamples > 0` for consistency; added trailing comma in initializer list.

Reviewed By: sebastianopeluso

Differential Revision: D109783970

fbshipit-source-id: e4da8765d1d9a32d954e1c60835035ca50bedadf
@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

kKPulla and others added 19 commits June 27, 2026 09:56
…g keys (facebookincubator#17918)

Summary:
Pull Request resolved: facebookincubator#17918

Dynamic filters generated by a hash-join probe are pushed toward the
probe-side source by tracing each filtered column through the identity
projections of intervening operators. The trace stops at the first
operator that does not expose the filtered channel as an identity
projection.

`StreamingAggregation` did not register its grouping keys in
`identityProjections_`, unlike `HashAggregation`, so the trace broke at the
aggregation and the filter never reached the scan. A plan with a streaming
(segmented) aggregation between a hash-join probe and a table scan therefore
missed dynamic filtering on the aggregation's grouping keys.

Grouping keys are identity passthroughs to the output, so register them as
identity projections and let the existing Driver trace push dynamic filters
through to the source. No new operator support is needed.

Fixes facebookincubator#17827

Reviewed By: mbasmanova

Differential Revision: D109527298

fbshipit-source-id: 42ac6d6cd8ee358a3369926991b8d5c54975de8a
…ding with filters (facebookincubator#17954)

Summary:
X-link: facebookincubator/nimble#916

Pull Request resolved: facebookincubator#17954

Dictionary-encoded string columns in the selective Nimble reader emit a `DictionaryVector` backed by a single alphabet merged across all chunks of a read range. A filter pushed down on such a column cannot be evaluated inside the encoding's per-row decode: the merged alphabet (and therefore the per-alphabet-index `filterCache`) is not known until every chunk has been read, and the inner encoding's `bulkScan` would apply a string filter directly to raw `int32` dictionary indices. The filter is therefore suppressed during the bulk `materializeIndices` read and applied post-hoc on the merged alphabet.

This diff replaces the previous in-encoding (`populateScanState` / framework `StringDictionaryColumnVisitor`) filter evaluation with that post-hoc path, makes it SIMD-accelerated, and removes the now-dead in-encoding code:

- Shared SIMD kernel. Add a free function `filterDictionaryRunSimd<kFilterOnly>(...)` in `velox/dwio/common/ColumnVisitors.h` that gathers each index's cached verdict from `filterCache`, resolves cache misses through the filter (recording the result back), and SIMD-compacts the passing rows (and, unless `kFilterOnly`, the passing indices). Nimble's `filterByCache` calls it directly. DWRF's `StringDictionaryColumnVisitor::processRun` is left unchanged; it still holds an equivalent inline loop, marked with a TODO to adopt the shared kernel in a follow-up diff (kept separate to isolate the DWRF change from this Nimble feature).
- Post-hoc filter in `StringColumnReader`. `readWithDictionary` saves and clears the scan-spec filter so the bulk index read runs, then `filterDictionaryIndices` applies the filter on the merged alphabet via `filterByCache`, compacting `rawValues_`/`outputRows_` to the passing rows. `ensureFilterCache` lazily sizes the per-alphabet-index cache, which grows and clears together with the alphabet.
- Output-layout null realignment. Pre-compacting `rawValues_`/`outputRows_` makes the framework's `compactScalarValues` a no-op (`rows.size() == numValues_`), which skips its null move, so `filterDictionaryIndices` realigns the result nulls itself in three cases: (1) no nulls — filter in place; (2) value filter rejects nulls — filter in place and mark the compacted output all-non-null; (3) IS NULL accepts nulls — merge null rows with the passing non-null rows in row order. `ensureWritableResultNulls` mirrors the framework's `shouldMoveNulls` by switching off the dense `returnReaderNulls_` fast path and allocating an output-indexed null bitmap.
- Delete dead code. With `StringColumnReader` the sole owner of dictionary filtering, the per-encoding `kHasFilter` branches and the `prepareResultNullsForDenseFilter` helper in `encodings/common/Encoding.h` (and their call sites in the `Constant`, `Dictionary`, `MainlyConstant`, and `RLE` encodings) are unreachable and removed, along with `StringColumnReader::populateScanState`.

Performance. On a simulated filter workload (`parallel_reader` over a dictionary-encoded string column, `l_returnflag='R'`, opt mode, `batch_size=1024`, concurrency 16, `read_count=500`, P50), the SIMD `filterCache` path reads at ~52 ms, versus ~70 ms for the parent diff's scalar per-row filter (−26%) and 130 ms for a no-dictionary flat read (−60%). The gain comes from evaluating the byte filter once per distinct dictionary value (`filterCache` memoization) and SIMD-gathering/compacting survivors, instead of testing the filter per row. (Parent/flat figures are from the 2026-06-05 A/B; the SIMD figure was reconfirmed on the current stack on 2026-06-13: P50 wall 52/52/54 ms across 3 runs.)

No behavior change for non-dictionary columns or for dictionary columns read without a pushed-down filter.

Reviewed By: Yuhta

Differential Revision: D102273283

fbshipit-source-id: 6ff65058a06af4be9f04056e6508062b3297793b
Alchemy-item: (ID = 1691) [OAP] Allow subfield rename and deletion for Parquet format commit 1/1 - 77be661
…ter join

Signed-off-by: Yuan <yuanzhou@apache.org>

Alchemy-item: (ID = 1227) [OAP] [11771] Fix smj result mismatch issue commit 1/1 - 987fd37
Alchemy-item: (ID = 1681) feat: Enable the hash join to accept a pre-built hash table for joining commit 1/1 - b27e71c
Alchemy-item: (ID = 1294) feat: Change SpillPartitionId::kMaxSpillLevel to 7 commit 1/1 - 7280b67
This commit introduces `PartitionedVector` - a low-level execution
abstraction that provides an in-place, partition-aware layout of a
vector based on per-row partition IDs.

1. **In-place rearrangement**: Rearrange vector data in memory without
   creating multiple copies
2. **Buffer reuse**: Allow reuse of temporary buffers across multiple
   partitioning operations
3. **Minimal abstraction**: Similar to `DecodedVector`, focus on
   efficient execution rather than operator semantics
4. **Thread-unsafe by design**: Optimized for single-threaded execution
   contexts

For more information please see #1703

Alchemy-item: (ID = 1150) Introducing PartitionedVector commit 1/1 - 960f41b

Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 1/11 - 76dc41a

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 1/11 - 2c59ee6

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 1/11 - 23e8d1d
Signed-off-by: Xin Zhang <xin-zhang2@ibm.com>

Alchemy-item: (ID = 1167) Add PartitionedRowVector commit 1/1 - f2af427

Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 2/11 - 3853bf6

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 2/11 - 71705c7

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 2/11 - 7df0be4
…dthValuesInPlace

Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 3/11 - ff2e34b

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 3/11 - 3d9e709

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 3/11 - 5719f90
Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 4/11 - 875c92c

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 4/11 - d787419

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 4/11 - fad8064
PartitionedFlatVector::partition() and PartitionedRowVector::partition()
called mutableRawNulls() unconditionally. mutableRawNulls() allocates a
null buffer if one does not exist, causing mayHaveNulls() to return true
for every vector after partitioning, even when the original had no nulls.

Fix both sites to check rawNulls() first and only call mutableRawNulls()
when a null buffer already exists.

Add noNullBufferAllocatedForNullFreeFlat and
noNullBufferAllocatedForNullFreeRow tests to PartitionedVectorTest to
cover this case.

# Conflicts:
#	velox/vector/PartitionedVector.cpp

Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 5/11 - 281a365

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 5/11 - 652dd0a

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 5/11 - 04fbad0
Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 6/11 - 6519a8f

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 6/11 - 28c45bd

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 6/11 - c3a52c9
Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 7/11 - d8f34b4

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 7/11 - 59b321a

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 7/11 - 9efe82a
Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 8/11 - 9eafc9d

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 8/11 - b27c492

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 8/11 - ff888e7
This commit introduces PrestoIterativePartitioningSerializer, which
buffers RowVectors across multiple append() calls, partitions rows
in-place using PartitionedVector, and on flush() serializes each
non-empty partition into a Presto wire-format IOBuf. The serializer has
no dependency on velox_exec: it returns raw folly::IOBuf objects,
leaving SerializedPage creation to the caller.

Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 9/11 - 6f09ea9

Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 9/11 - 48018b3

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 9/11 - 1511c01
This commit introduces OptimizedPartitionedOutput, a PartitionedOutput
operator backed by PrestoIterativePartitioningSerializer. Enabled via query
config key "optimized_repartitioning" (default off). LocalPlanner
selects it over the standard PartitionedOutput when the flag is set.

TODO: replicateNullsAndAny is not yet supported and raises a user error.

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 10/11 - e567d50
…geBenchmark

- Added normal vs optimized PartitionedOutput comparison by running each
  exchange case twice with kOptimizedPartitionedOutputEnabled=false/true.
- Added per-mode benchmark names:
  - exchange<Case>_normalPartitionedOutput
  - exchange<Case>_optimizedPartitionedOutput in ExchangeBenchmark.cpp.
- Refactored result printing into shared helpers and fixed output
  consistency in ExchangeBenchmark.cpp.

Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 11/11 - ee25fa7
Alchemy-item: (ID = 1673) feat: Implement se/dser method for HashTable commit 1/1 - 5624f6e
Signed-off-by: Linsong Wang <linsong.wang@ibm.com>
@prestodb-ci

Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.