feat(PartitionedOutput): Add dictionary encoding support by xin-zhang2 · Pull Request #2119 · IBM/velox

xin-zhang2 · 2026-06-09T14:53:20Z

No description provided.

This commit introduces `PartitionedVector` - a low-level execution abstraction that provides an in-place, partition-aware layout of a vector based on per-row partition IDs. 1. **In-place rearrangement**: Rearrange vector data in memory without creating multiple copies 2. **Buffer reuse**: Allow reuse of temporary buffers across multiple partitioning operations 3. **Minimal abstraction**: Similar to `DecodedVector`, focus on efficient execution rather than operator semantics 4. **Thread-unsafe by design**: Optimized for single-threaded execution contexts For more information please see IBM#1703 Alchemy-item: (ID = 1150) Introducing PartitionedVector commit 1/1 - 960f41b

Signed-off-by: Xin Zhang <xin-zhang2@ibm.com> Alchemy-item: (ID = 1167) Add PartitionedRowVector commit 1/1 - f2af427

…dthValuesInPlace

PartitionedFlatVector::partition() and PartitionedRowVector::partition() called mutableRawNulls() unconditionally. mutableRawNulls() allocates a null buffer if one does not exist, causing mayHaveNulls() to return true for every vector after partitioning, even when the original had no nulls. Fix both sites to check rawNulls() first and only call mutableRawNulls() when a null buffer already exists. Add noNullBufferAllocatedForNullFreeFlat and noNullBufferAllocatedForNullFreeRow tests to PartitionedVectorTest to cover this case. # Conflicts: # velox/vector/PartitionedVector.cpp

This commit introduces PrestoIterativePartitioningSerializer, which buffers RowVectors across multiple append() calls, partitions rows in-place using PartitionedVector, and on flush() serializes each non-empty partition into a Presto wire-format IOBuf. The serializer has no dependency on velox_exec: it returns raw folly::IOBuf objects, leaving SerializedPage creation to the caller.

This commit introduces OptimizedPartitionedOutput, a PartitionedOutput operator backed by PrestoIterativePartitioningSerializer. Enabled via query config key "optimized_repartitioning" (default off). LocalPlanner selects it over the standard PartitionedOutput when the flag is set. TODO: replicateNullsAndAny is not yet supported and raises a user error.

…geBenchmark - Added normal vs optimized PartitionedOutput comparison by running each exchange case twice with kOptimizedPartitionedOutputEnabled=false/true. - Added per-mode benchmark names: - exchange<Case>_normalPartitionedOutput - exchange<Case>_optimizedPartitionedOutput in ExchangeBenchmark.cpp. - Refactored result printing into shared helpers and fixed output consistency in ExchangeBenchmark.cpp.

…mark Split the local partition exchange benchmark out of ExchangeBenchmark into its own executable and CMake target, while keeping the local benchmark logic and statistics reporting available in a dedicated binary.

…tioningSerializer

…fferManager listeners Pass an OutputBufferManager-backed listener factory into PrestoIterativePartitioningSerializer so the optimized path uses the same listener source as normal PartitionedOutput. Create per-partition listeners during flush, set the checksum bit only when a listener is present, and compute the page checksum only for PrestoOutputStreamListener instances. Also add tests that verify checksum headers are written and that the serialized pages round-trip through the standard deserializer.

…rting - add explicit simple-schema benchmark cases by type and column count - register normal and optimized runs as separate named benchmark cases - make `dictPct` apply per generated vector and recurse into nested types - generate benchmark input vectors directly with optional nulls - replace ad hoc flat input generation with explicit input specs - return `ExchangeRunStats` from benchmark runs and centralize query config - group printed results by dataset with normal vs. optimized stats

The new OptimizedVectorHasher is up to 2-3x faster than VectorHasher.

BatchMaker always allocates a null buffer even when no rows are null. This commit removes it so benchmarks measure the non-nullable path faithfully. Plus some minor format cleanups.

Currently Velox never passes CMAKE_BUILD_TYPE into Folly's own configure step, while cmake_install only forwards arbitrary caller flags, so Folly was not built in release mode when Velox is built in release mode. This commit adds -DCMAKE_BUILD_TYPE="${CMAKE_BUILD_TYPE}" to FOLLY_FLAGS, so a release Velox dependency setup now builds release Folly on macOS and Linux.

- Remove benchmarks with vector size 1000000 and only keep vector size 10000. - Add tinyint and smallint benchmarks.

Introduce OptimizedHashPartitionFunction as a faster drop-in replacement for HashPartitionFunction, gated behind a new query config flag optimized_hash_partition_function_enabled (default false). partition() is improved from 50% to over 200x. Add HashPartitionFunctionBase as a common base exposing numPartitions(), and createHashPartitionFunction() factories that select the implementation based on the flag. Thread QueryConfig* through PartitionFunctionSpec::create() and update callsites (LocalPartition, PartitionedOutput, MarkDistinct, RowNumber, Window, SubPartitionedSortWindowBuild, HiveConnector) to construct partition functions via the factory. Register CMake targets for the new test and benchmark binaries.

Reimplement PrestoIterativePartitioningSerializer::flushRowColumn around bulk bit operations and fix two coupled null-handling defects in the simple-column path. The three changes are interdependent: correct null serialization requires all of them, so they land together. flushRowColumn: - AND the incoming parentNulls into this level's own nulls in place with bits::andBits, reusing the vector's own buffer (or sharing the parent's when there are no own nulls) instead of allocating a per-batch buffer. - Count parent-live and live rows per partition with bits::countBits to drive child rowCounts/parentNullCounts and the footer numRows. - Compact the live bitmap across batch boundaries with bits::extractBits via a branchless compactBits/appendLowBits helper; offsets become a prefix sum over the compacted bitmap, with a sequential fast path for null-free partitions. flushFlatValues: stop calling mutableRawNulls() (which materialized an all-not-null buffer and masked the no-null fast path, then dereferenced a null parentNulls in andBits), and iterate each partition's own [lastOffset, offset) range. Leaf flushers now take parentNulls directly; flushSingleConstantVector also takes the per-partition parent-null counts. flushNulls: replace the stub that always wrote hasNulls=0 with a real implementation that compacts per-batch validity over parentNulls into a per-partition bitmap sized to context.rowCounts, handling FLAT and CONSTANT (including null constants) for both top-level and nested columns. Removes the now-unused flushSimpleVectorNulls/flushConstantVectorNulls. Add multiLevelNullsMultiAppendMultiPartition covering nulls at multiple nested ROW levels across several appends and partitions. All 59 tests in PrestoIterativePartitioningSerializerTest pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yingsu00 and others added 26 commits April 17, 2026 13:12

feat: Add PartitionedRowVector implementation

3853bf6

Signed-off-by: Xin Zhang <xin-zhang2@ibm.com> Alchemy-item: (ID = 1167) Add PartitionedRowVector commit 1/1 - f2af427

refactor: Move initializeCursorPartitionOffsets into partitionFixedWi…

ff2e34b

…dthValuesInPlace

fix: Add bool specialization for partitionFixedWidthValues

875c92c

feat: Add ParitionedConstantVector implementation

6519a8f

Add PartitionedVector benchmark

d8f34b4

feat(PartitionedOutput): Add numNullsPerPartition_ to PartitionedVector

9eafc9d

feat(PartitionedOutput): Add constant support in PrestoIterativeParti…

627bf5d

…tioningSerializer

perf: Introduce OptimizedVectorHasher

b38390f

The new OptimizedVectorHasher is up to 2-3x faster than VectorHasher.

feat(PartitionedOutput): fix test failures caused by listenerFactory

fceb8bc

feat(PartitionedOutput): Improve PartitionedVectorBenchmark

1760e33

BatchMaker always allocates a null buffer even when no rows are null. This commit removes it so benchmarks measure the non-nullable path faithfully. Plus some minor format cleanups.

misc: Clean up OptimizedVectorHasherBenchmark.cpp

aabd0b1

- Remove benchmarks with vector size 1000000 and only keep vector size 10000. - Add tinyint and smallint benchmarks.

feat(PartitionedOutput): Add BufferState to track bytesBuffered

86c1e73

feat(PartitionedOutput): Add outputChannels support

38254ae

perf: Add AVX512 support

ecd87e4

feat(PartitionedOutput): Add dictionary encoding support

c05b0fa

xin-zhang2 added the OptimizedPartitioning label Jun 9, 2026

xin-zhang2 self-assigned this Jun 9, 2026

xin-zhang2 requested a review from yingsu00 June 9, 2026 15:06

ethanyzhang force-pushed the optimized_partitionedoutput branch from cff9eff to 6944d36 Compare June 11, 2026 23:04

ethanyzhang requested a review from majetideepak as a code owner June 11, 2026 23:04

ethanyzhang force-pushed the optimized_partitionedoutput branch from 6944d36 to 974bb09 Compare June 11, 2026 23:17

xin-zhang2 force-pushed the optimized_partitionedoutput branch from 974bb09 to ee25fa7 Compare June 25, 2026 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(PartitionedOutput): Add dictionary encoding support#2119

feat(PartitionedOutput): Add dictionary encoding support#2119
xin-zhang2 wants to merge 26 commits into
IBM:optimized_partitionedoutputfrom
xin-zhang2:PartitionedOutput-dict

xin-zhang2 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xin-zhang2 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants