Skip to content

feat: optimize row based shuffle with DenseRow#637

Open
zhangxffff wants to merge 3 commits into
bytedance:mainfrom
zhangxffff:feat/lazy_deserializer
Open

feat: optimize row based shuffle with DenseRow#637
zhangxffff wants to merge 3 commits into
bytedance:mainfrom
zhangxffff:feat/lazy_deserializer

Conversation

@zhangxffff

@zhangxffff zhangxffff commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Row-based shuffle serializes each row with CompactRow, which pads
values to fixed widths, carries a per-row null bitmap, and stores integers at
full width. For the common Spark case of small / narrow / nullable integer
columns this wastes a large fraction of shuffle bytes.

This PR adds DenseRow, a no-waste variable-length row format, and wires it
into the Spark row-based shuffle path behind a pluggable RowFormat option.
The existing CompactRow path is unchanged and remains the default.

Issue Number: close #540

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

DenseRow codec (bolt/row/dense/) — a column-batched serializer, sibling to
CompactRow / UnsafeRowFast, that processes all rows at once. The on-wire
layout is the "dense" form:

  1. variable-length (varint/zigzag) integer values,
  2. nulls fused into the structure bytes — no null bitmap,
  3. no alignment padding,
  4. level-hoisted nesting for ARRAY/MAP/ROW.

Implementation is split into:

  • IntVarint.h — varint/zigzag + nullable-int wire codec, with a SIMD
    (xsimd/AVX2) batched size pass and a BMI2 (pdep) varint writer for long
    values; scalar fast path otherwise.
  • DenseRowScalar*.cpp — flat fast path (identity / constant / dictionary
    decoded inputs).
  • DenseRowGeneral*.cpp — general path for arbitrary nested typed data via a two-pass
    (size, then write) slot-tree walk shared by both passes.
  • DenseRow.{h,cpp} — public API mirroring CompactRow (rowSizes() /
    serialize() / deserialize()), grammar documented at the top of the .cpp.

Shuffle integration (bolt/shuffle/sparksql/) — new row::RowFormat enum
({DENSE, COMPACT}) added to ShuffleWriterOptions / ShuffleReaderOptions
(default COMPACT). The columnar↔row converters and the reader/writer dispatch
on it; the wire framing (rowSize | rowData) is unchanged, so only the row body
encoding differs. Writer and reader must be configured with the same format.

TestsDenseRowTest.cpp (17 cases: scalars, wide/mixed rows, bigint &
hugeint edges, arrays / nested arrays / maps, empty containers, dictionary &
constant encoded inputs, non-contiguous serialize offsets, and golden-byte wire
assertions). The shuffle matrix test is extended to exercise the DENSE format.

BenchmarkDenseRowSerializeBenchmark.cpp compares DenseRow against
UnsafeRow/CompactRow for serialize, deserialize, and serialized size across 20
cases

Benchmark Result

— Intel i7-12700KF, -O3 AVX2+BMI2, VectorFuzzer vectorSize=1000,
pinned core. All times are ns/row; size is bytes/row. Compared against the
two existing Spark row formats (UnsafeRow, CompactRow).

Highlights

  • Size: up to 9× smaller than CompactRow on small ints (bigint < 2⁸: 9→1),
    on small-int arrays/nested arrays. On par for full-random int64 and wide strings.
  • Deserialize: 2.7–2.8× faster than CompactRow on multi-column flat rows
    (10 scalars: 89→32 ns), faster on all flat scalars.
  • Trade-off: container serialize is slower than CompactRow
    (CompactRow would do buckcopy for array/map) — smaller payload for higher encode cost.

Serialized size (bytes/row)

case UnsafeRow CompactRow DenseRow vs Compact
bigint, |v| < 2⁸ 16 9 1 9.0×
bigint, |v| < 2³² 16 9 4 2.3×
bigint, random 16 9 9 1.0×
bigint, random + 40% null 16 9 6 1.5×
double 16 9 8 1.1×
string(8) 24 13 9 1.4×
string(100) 120 105 101 1.04×
5 scalars 48 23 23 1.0×
5 scalars, small int 48 23 16 1.4×
10 scalars 88 50 54 0.9×
10 scalars, small int 88 50 31 1.6×
array 116 91 101 0.9×
array⟨bigint < 2⁸⟩ 81 56 11 5.1×
array⟨bigint < 2³²⟩ 78 53 30 1.8×
nested array 265 135 63 2.1×
nested⟨bigint < 2⁸⟩ 368 272 53 5.1×
nested⟨bigint < 2³²⟩ 351 259 133 1.9×
map 177 132 136 1.0×
map⟨bigint < 2⁸⟩ 127 82 32 2.6×
map⟨bigint < 2³²⟩ 123 78 50 1.6×

Note

We use the columnar CompactRow serde interface in this benchmark. In row-based shuffle, the row-based CompactRow serde interface is still used, which means CompactRow would see even greater performance improvement in real row-based shuffle scenarios. The array/map use bulk copy to serde in CompactRow, so it is fast than DenseRow for these type.

case ser C ser D deser C deser D round-trip
bigint < 2⁸ 1.8 3.6 5.5 2.4 1.2× faster
bigint < 2³² 1.8 4.5 4.9 4.2 0.8×
bigint random 1.8 4.1 5.0 4.1 0.8×
bigint random + null 2.4 4.4 5.4 5.6 0.8×
double 1.7 2.3 5.5 1.8 1.8× faster
string(8) 7.7 6.8 8.3 6.4 1.2× faster
string(100) 9.9 8.1 10.5 8.9 1.2× faster
5 scalars 7.1 8.3 27.4 10.2 1.9× faster
5 scalars, small 6.7 7.6 25.7 7.6 2.1× faster
10 scalars 17.8 29.0 89.2 31.5 1.8× faster
10 scalars, small 17.9 16.6 65.1 15.9 2.6× faster
array 16.4 99.1 57.0 76.5 0.4×
array⟨bigint < 2⁸⟩ 12.3 22.2 21.5 40.4 0.5×
array⟨bigint < 2³²⟩ 11.9 29.2 22.4 43.6 0.5×
nested array 142.4 142.7 84.0 111.2 0.9×
nested⟨bigint < 2⁸⟩ 92.1 186.5 143.3 221.0 0.6×
nested⟨bigint < 2³²⟩ 87.4 174.5 103.0 284.2 0.4×
map 25.8 97.1 103.3 91.0 0.7×
map⟨bigint < 2⁸⟩ 21.8 28.2 45.8 52.5 0.8×
map⟨bigint < 2³²⟩ 21.7 32.7 47.4 57.1 0.8×

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@zhangxffff zhangxffff marked this pull request as draft June 9, 2026 14:20
@zhangxffff zhangxffff marked this pull request as ready for review June 23, 2026 12:24
const uint8_t*& in,
const uint8_t* end,
bool& isNull,
int64_t& value) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check bounds(in) before accessing *in
if (FOLLY_UNLIKELY(in >= end)) { return false}

forEachLivePos(out, r, [&](uint32_t p) {
uint64_t v{0};
BOLT_USER_CHECK(
readVarint(c.cur, c.end, v),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check c.cur < c.end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants