feat(selective): SIMD filterCache evaluation for dictionary index reading with filters (#17954)#1433
feat(selective): SIMD filterCache evaluation for dictionary index reading with filters (#17954)#1433prestodb-ci wants to merge 25 commits into
Conversation
|
Test passed for commit 1777c7523, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/394/display/redirect for details |
Signed-off-by: Yuan <yuanzhou@apache.org> Set ccache maximum size to 1G Remove sed command from Gluten workflow Removed a sed command that replaces 'oap-project' with 'IBM' in the get-velox.sh script. Modify get-velox.sh to change 'ibm' to 'ibm-xxx' Update the get-velox.sh script to replace 'ibm' with 'ibm-xxx'. Update sed command to be case-insensitive Update gluten.yml fix iceberg unit test Signed-off-by: Yuan <yuanzhou@apache.org> Update gluten.yml Enable enhanced features in gluten build script Update cache keys for Gluten workflow
|
❌ Test commit 2a50cd131 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/PR-2057/2/display/redirect for details |
|
❌ Test commit 0007f37c6 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/PR-2057/4/display/redirect for details |
|
❌ Test commit 0007f37c6 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/PR-2057/5/display/redirect for details |
28dabdb to
3026e31
Compare
|
❌ Test commit 0007f37c6 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/PR-2057/6/display/redirect for details |
|
❌ Test commit 0007f37c6 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/PR-2057/7/display/redirect for details |
|
❌ Test commit 0007f37c6 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/PR-2057/8/display/redirect for details |
|
❌ Test commit 0007f37c6 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/PR-2057/9/display/redirect for details |
3026e31 to
72886ca
Compare
set restore key
72886ca to
4a8bc9b
Compare
|
❌ Test commit a017fca78 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/427/display/redirect for details |
|
❌ Test commit a236cf5b3 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/428/display/redirect for details |
|
❌ Test commit 0f21ff954 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/429/display/redirect for details |
|
❌ Test commit e01ca2187 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/430/display/redirect for details |
|
❌ Test commit 4f39eef54 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/431/display/redirect for details |
|
❌ Test commit eba05296b failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/437/display/redirect for details |
|
❌ Test commit bf9e9fd6d failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/440/display/redirect for details |
|
❌ Test commit 4ff28c6bf failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/449/display/redirect for details |
|
❌ Test commit 1a4d1241a failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1814/display/redirect for details |
|
❌ Test commit 39db1a2c5 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1815/display/redirect for details |
|
❌ Test commit ae44d34da failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1819/display/redirect for details |
|
❌ Test commit 3069f55ca failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1821/display/redirect for details |
|
❌ Test commit 69519c68d failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1822/display/redirect for details |
|
❌ Test commit f13ddcb8c failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1827/display/redirect for details |
|
❌ Test commit 9e63ccfe0 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1830/display/redirect for details |
…ubator#17940) Summary: Pull Request resolved: facebookincubator#17940 X-link: facebookincubator/nimble#911 Hardens the dictionary visitor read path against cross-chunk null-handling bugs surfaced by multi-chunk fuzzing, plus a clarity rename of one helper and an aarch64 build fix. Sits on top of D108709293 (RLEFix) in the dictionary vector stack. ## Bug fix: stale result-null bits leak across buffer-reused reads When a chunk is nullable and the output row set is sparse/compacted (`returnReaderNulls_ == false`), the dictionary-index read path cleared `resultNulls_` only lazily -- it touched the buffer only when it actually emitted a null. A read whose surviving rows are all non-null therefore never cleared `resultNulls_`, so a stale null bit left over from a prior (buffer-reused) read leaked into the result vector and corrupted the output. Fix: prepare the result-null buffer eagerly in the sparse dictionary-index reader instead of on the first emitted null. `readSparseMaterializedIndices` now clears `resultNulls_` up front whenever the read range carries nulls (`rawNulls != nullptr`), so an all-non-null read still starts from a clean buffer. The prep lives in the reader that sees `rawNulls`, not in a single `NullableEncoding` chokepoint, because output nulls have two sources: the encoding's own nulls (via `NullableEncoding`) and incoming/inMap nulls on a non-nullable column nested under a nullable parent (e.g. a flat-map dict child with inMap nulls). The latter has no `NullableEncoding` in its read path, so a chokepoint there would never run for it; `rawNulls` is set for both, matching how the dense and filtered index readers already prepare their buffers. ## Rename: `initReturnReaderNulls` -> `setReturnNullsMode` The helper only sets which buffer `resultNulls()` returns for the read -- it sets the `returnReaderNulls_` flag (and `anyNulls_`) and allocates nothing. The old name implied buffer initialization/allocation. Renamed across the Velox `SelectiveColumnReader` API and all Nimble call sites (legacy and non-legacy `NullableEncoding`, `ChunkedDecoder`, the `ReadWithVisitorParams` field in `Encoding.h`, `ReadWithVisitorTest`, `EncodingBench`). ## Build fix: bucket_benchmark on aarch64 `dwio/nimble/encodings/tests:bucket_benchmark` hardcoded x86-only tuning flags (`-mavx`, `-mavx2`, `-mbmi`, `-mbmi2`, `-mclzero`, `-mlzcnt`) in `compiler_flags`, which clang rejects for `aarch64-redhat-linux-gnu`, so the target failed to build on aarch64. The benchmark source uses no raw x86 intrinsics, so the flags are now arch-gated with `select()` (applied only on `ovr_config//cpu:x86_64`) and the portable source builds on aarch64 without them. Pre-existing breakage surfaced by this diff rebuilding the shared encoding headers. ## Test coverage: exercise cross-chunk resume paths `E2EFilterTest` previously wrote a single chunk per stream, so the cross-chunk reader resume paths (where the above bug lives) were never exercised by standalone tests. Added two helpers and applied them across the dictionary / mainly-constant / RLE / nullable / fuzz cases: - `applyMultiChunkOptions(VeloxWriterOptions&)` -- enable chunking, zero `minStreamChunkRawSize`, and a flush policy that never flushes the stripe but chunks at every `write()` boundary. - `writeInChunks(writer, batch, numChunks=3)` -- slice a batch into N row-slices so each emits its own chunk (a single `write()` yields only one chunk). Note: touches Velox-owned `velox/dwio/common/SelectiveColumnReader.{h,cpp}` (rename only) -- needs Velox-owner sign-off before landing. Reviewed By: xiaoxmeng Differential Revision: D108873522 fbshipit-source-id: 2c05c2f2d4f64b8ea20142ac99dc4b60b7014cec
… breakdown (facebookincubator#17943) Summary: Pull Request resolved: facebookincubator#17943 Adds 11 new runtime stats to RPCOperator, recorded at `close()` for per-operator visibility into RPC transport behavior. **Congestion window:** - `rpcCongestionWindowFinal`: final window limit at close - `rpcCongestionShrinks`: total shrink events (onError halving + gradient shrinks) - `rpcBaselineRttNanos`: learned baseline RTT (unloaded) - `rpcPeakInFlight`: high-water mark of concurrent in-flight units **Transport RTT:** - `rpcRttMinWallNanos` / `rpcRttMaxWallNanos`: min/max round-trip across all completed units - `rpcRttCount`: number of RTT samples **Error-kind breakdown:** - `rpcErrorKindRateLimited` / `rpcErrorKindTimeout` / `rpcErrorKindBackendError`: per-kind error counts (complement the existing aggregate `rpcErrorCount`) **Mode:** - `rpcStreamingMode`: 0 = PER_ROW, 1 = BATCH **Aggregation fix:** registers 6 non-additive stats (window, peak, baseline, min/max RTT, mode) in `shouldAggregateRuntimeMetric` so they get `.aggregate()` (count/min/max) instead of being summed across drivers. **CongestionController accessors:** adds `baselineRttNs()`, `numShrinks()` public accessors and `numShrinks_` private member so `RPCState::operatorSnapshot()` can read the controller state under its own mutex. **V2 review fixes:** renamed `shrinkCount_` to `numShrinks_` and `rttCount_` to `numRttSamples_` (Velox `numXxx` convention); fixed asymmetric shrink counting in `onError()` (only counts when window actually decreases); extracted duplicated error-kind switch into `recordErrorKind()` helper; listed all `RPCErrorKind` enum values explicitly (fixes `clang-diagnostic-switch-enum`); combined `congestionSnapshot()` + `transportSnapshot()` + `streamingMode()` into a single `operatorSnapshot() const` method (one lock acquisition, consistent snapshot); gated `kRpcRttCount` emission on `numRttSamples > 0` for consistency; added trailing comma in initializer list. Reviewed By: sebastianopeluso Differential Revision: D109783970 fbshipit-source-id: e4da8765d1d9a32d954e1c60835035ca50bedadf
|
❌ Test commit a018cad73 failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1836/display/redirect for details |
…g keys (facebookincubator#17918) Summary: Pull Request resolved: facebookincubator#17918 Dynamic filters generated by a hash-join probe are pushed toward the probe-side source by tracing each filtered column through the identity projections of intervening operators. The trace stops at the first operator that does not expose the filtered channel as an identity projection. `StreamingAggregation` did not register its grouping keys in `identityProjections_`, unlike `HashAggregation`, so the trace broke at the aggregation and the filter never reached the scan. A plan with a streaming (segmented) aggregation between a hash-join probe and a table scan therefore missed dynamic filtering on the aggregation's grouping keys. Grouping keys are identity passthroughs to the output, so register them as identity projections and let the existing Driver trace push dynamic filters through to the source. No new operator support is needed. Fixes facebookincubator#17827 Reviewed By: mbasmanova Differential Revision: D109527298 fbshipit-source-id: 42ac6d6cd8ee358a3369926991b8d5c54975de8a
…ding with filters (facebookincubator#17954) Summary: X-link: facebookincubator/nimble#916 Pull Request resolved: facebookincubator#17954 Dictionary-encoded string columns in the selective Nimble reader emit a `DictionaryVector` backed by a single alphabet merged across all chunks of a read range. A filter pushed down on such a column cannot be evaluated inside the encoding's per-row decode: the merged alphabet (and therefore the per-alphabet-index `filterCache`) is not known until every chunk has been read, and the inner encoding's `bulkScan` would apply a string filter directly to raw `int32` dictionary indices. The filter is therefore suppressed during the bulk `materializeIndices` read and applied post-hoc on the merged alphabet. This diff replaces the previous in-encoding (`populateScanState` / framework `StringDictionaryColumnVisitor`) filter evaluation with that post-hoc path, makes it SIMD-accelerated, and removes the now-dead in-encoding code: - Shared SIMD kernel. Add a free function `filterDictionaryRunSimd<kFilterOnly>(...)` in `velox/dwio/common/ColumnVisitors.h` that gathers each index's cached verdict from `filterCache`, resolves cache misses through the filter (recording the result back), and SIMD-compacts the passing rows (and, unless `kFilterOnly`, the passing indices). Nimble's `filterByCache` calls it directly. DWRF's `StringDictionaryColumnVisitor::processRun` is left unchanged; it still holds an equivalent inline loop, marked with a TODO to adopt the shared kernel in a follow-up diff (kept separate to isolate the DWRF change from this Nimble feature). - Post-hoc filter in `StringColumnReader`. `readWithDictionary` saves and clears the scan-spec filter so the bulk index read runs, then `filterDictionaryIndices` applies the filter on the merged alphabet via `filterByCache`, compacting `rawValues_`/`outputRows_` to the passing rows. `ensureFilterCache` lazily sizes the per-alphabet-index cache, which grows and clears together with the alphabet. - Output-layout null realignment. Pre-compacting `rawValues_`/`outputRows_` makes the framework's `compactScalarValues` a no-op (`rows.size() == numValues_`), which skips its null move, so `filterDictionaryIndices` realigns the result nulls itself in three cases: (1) no nulls — filter in place; (2) value filter rejects nulls — filter in place and mark the compacted output all-non-null; (3) IS NULL accepts nulls — merge null rows with the passing non-null rows in row order. `ensureWritableResultNulls` mirrors the framework's `shouldMoveNulls` by switching off the dense `returnReaderNulls_` fast path and allocating an output-indexed null bitmap. - Delete dead code. With `StringColumnReader` the sole owner of dictionary filtering, the per-encoding `kHasFilter` branches and the `prepareResultNullsForDenseFilter` helper in `encodings/common/Encoding.h` (and their call sites in the `Constant`, `Dictionary`, `MainlyConstant`, and `RLE` encodings) are unreachable and removed, along with `StringColumnReader::populateScanState`. Performance. On a simulated filter workload (`parallel_reader` over a dictionary-encoded string column, `l_returnflag='R'`, opt mode, `batch_size=1024`, concurrency 16, `read_count=500`, P50), the SIMD `filterCache` path reads at ~52 ms, versus ~70 ms for the parent diff's scalar per-row filter (−26%) and 130 ms for a no-dictionary flat read (−60%). The gain comes from evaluating the byte filter once per distinct dictionary value (`filterCache` memoization) and SIMD-gathering/compacting survivors, instead of testing the filter per row. (Parent/flat figures are from the 2026-06-05 A/B; the SIMD figure was reconfirmed on the current stack on 2026-06-13: P50 wall 52/52/54 ms across 3 runs.) No behavior change for non-dictionary columns or for dictionary columns read without a pushed-down filter. Reviewed By: Yuhta Differential Revision: D102273283 fbshipit-source-id: 6ff65058a06af4be9f04056e6508062b3297793b
Alchemy-item: (ID = 1691) [OAP] Allow subfield rename and deletion for Parquet format commit 1/1 - 77be661
…ter join Signed-off-by: Yuan <yuanzhou@apache.org> Alchemy-item: (ID = 1227) [OAP] [11771] Fix smj result mismatch issue commit 1/1 - 987fd37
Alchemy-item: (ID = 1681) feat: Enable the hash join to accept a pre-built hash table for joining commit 1/1 - b27e71c
Alchemy-item: (ID = 1294) feat: Change SpillPartitionId::kMaxSpillLevel to 7 commit 1/1 - 7280b67
This commit introduces `PartitionedVector` - a low-level execution abstraction that provides an in-place, partition-aware layout of a vector based on per-row partition IDs. 1. **In-place rearrangement**: Rearrange vector data in memory without creating multiple copies 2. **Buffer reuse**: Allow reuse of temporary buffers across multiple partitioning operations 3. **Minimal abstraction**: Similar to `DecodedVector`, focus on efficient execution rather than operator semantics 4. **Thread-unsafe by design**: Optimized for single-threaded execution contexts For more information please see #1703 Alchemy-item: (ID = 1150) Introducing PartitionedVector commit 1/1 - 960f41b Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 1/11 - 76dc41a Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 1/11 - 2c59ee6 Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 1/11 - 23e8d1d
Signed-off-by: Xin Zhang <xin-zhang2@ibm.com> Alchemy-item: (ID = 1167) Add PartitionedRowVector commit 1/1 - f2af427 Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 2/11 - 3853bf6 Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 2/11 - 71705c7 Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 2/11 - 7df0be4
PartitionedFlatVector::partition() and PartitionedRowVector::partition() called mutableRawNulls() unconditionally. mutableRawNulls() allocates a null buffer if one does not exist, causing mayHaveNulls() to return true for every vector after partitioning, even when the original had no nulls. Fix both sites to check rawNulls() first and only call mutableRawNulls() when a null buffer already exists. Add noNullBufferAllocatedForNullFreeFlat and noNullBufferAllocatedForNullFreeRow tests to PartitionedVectorTest to cover this case. # Conflicts: # velox/vector/PartitionedVector.cpp Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 5/11 - 281a365 Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 5/11 - 652dd0a Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 5/11 - 04fbad0
This commit introduces PrestoIterativePartitioningSerializer, which buffers RowVectors across multiple append() calls, partitions rows in-place using PartitionedVector, and on flush() serializes each non-empty partition into a Presto wire-format IOBuf. The serializer has no dependency on velox_exec: it returns raw folly::IOBuf objects, leaving SerializedPage creation to the caller. Alchemy-item: (ID = 1327) Optimized PartitionedOutput staging hub commit 9/11 - 6f09ea9 Alchemy-item: (ID = 1596) Optimized PartitionedOutput staging hub commit 9/11 - 48018b3 Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 9/11 - 1511c01
This commit introduces OptimizedPartitionedOutput, a PartitionedOutput operator backed by PrestoIterativePartitioningSerializer. Enabled via query config key "optimized_repartitioning" (default off). LocalPlanner selects it over the standard PartitionedOutput when the flag is set. TODO: replicateNullsAndAny is not yet supported and raises a user error. Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 10/11 - e567d50
…geBenchmark - Added normal vs optimized PartitionedOutput comparison by running each exchange case twice with kOptimizedPartitionedOutputEnabled=false/true. - Added per-mode benchmark names: - exchange<Case>_normalPartitionedOutput - exchange<Case>_optimizedPartitionedOutput in ExchangeBenchmark.cpp. - Refactored result printing into shared helpers and fixed output consistency in ExchangeBenchmark.cpp. Alchemy-item: (ID = 1682) Optimized PartitionedOutput staging hub commit 11/11 - ee25fa7
Alchemy-item: (ID = 1673) feat: Implement se/dser method for HashTable commit 1/1 - 5624f6e
Signed-off-by: Linsong Wang <linsong.wang@ibm.com>
|
❌ Test commit e06dd2b5e failed, open https://ci.ibm.prestodb.dev/job/presto-performance/job/presto-performance/job/pipeline-rebase-ibm-velox/1839/display/redirect for details |
Test PR for branch staging-rebase-pr with head e06dd2b