Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
6094937
wip
May 16, 2026
913688c
wip2
May 16, 2026
dccaad3
wip
May 20, 2026
ac4b582
commit again
Jun 1, 2026
92937d7
extract function-specific logics from aggregator
Jun 1, 2026
ad04ebc
refactor slot group logics
Jun 1, 2026
921e377
push down aggr func ir codegen from hashaggrjit to seperate aggr func…
Jun 1, 2026
0aa3cf4
temperarily disable extractValues jit
Jun 3, 2026
3ea7e2b
modify build version
Jun 3, 2026
ecdfa76
add debug logs
Jun 4, 2026
4be72fe
fix for AggregateCompanionAdapter
Jun 4, 2026
aaa2ed6
fix for AggregateCompanionAdapter
Jun 4, 2026
93ea366
add debug logs for llmv ir
Jun 4, 2026
3485ac7
support decimal input for avg_partial and sum_partial
Jun 4, 2026
6c25ffb
push down jit implements from xxxbase to xxx
Jun 5, 2026
c558666
remove config
Jun 6, 2026
b958315
support bool input for max/min/count
Jun 6, 2026
674df41
support decimal input for max/min/count
Jun 6, 2026
ac4026c
split addrawinput and addintermediateresults for hash aggr jit
Jun 6, 2026
8017cc5
fix bugs when extractxxx are executed in non-JIT way and numNulls_ is…
Jun 7, 2026
ca78493
support exactxxx in jit
Jun 7, 2026
741ae93
reset log level for jit
Jun 7, 2026
f947804
remove helper function from jit for dict/flat/const encodings
Jun 8, 2026
e360bb8
add performance analysis after helper function opt
Jun 8, 2026
862e32d
enable jit symbols in perf
Jun 8, 2026
a24ff03
δΌ˜εŒ–extractxxxδΈ­ηš„helper function
Jun 8, 2026
be0e666
update performance report
Jun 9, 2026
dc9687d
optimize performance of partial_avg_extract
Jun 9, 2026
9b766b1
update doc
Jun 10, 2026
9dcdbdd
remove hard-coded offset
Jun 10, 2026
266b38c
remove update numNulls_ in jit execution way
Jun 10, 2026
3a73376
reuse HashAggrJitDescriptor struct in HashAggrJitSlot
Jun 10, 2026
2363a20
refactor to speed up compile
Jun 10, 2026
061ba6d
update doct
Jun 10, 2026
c15f256
remove more helper in ir
Jun 10, 2026
cb8e15a
remove helper function jit_HashAggrSetPartialAvgDouble
Jun 11, 2026
804fac9
minor refactor to improve HashAggrJitChunk
Jun 11, 2026
7ef69cd
cache function name in HashAggrJitChunk
Jun 11, 2026
e693583
remove vlog in hot path
Jun 11, 2026
28535ed
add more benchmark cases
Jun 11, 2026
93ddddd
support decimal_avg final extract
Jun 11, 2026
ffd029c
fix benchmark crash by skip decimal avg final extract
Jun 11, 2026
284f2b6
add more cases in benchmark
Jun 11, 2026
b2fbef3
fix final extract crash in decimal avg
Jun 11, 2026
25d83d6
add more bench cases
Jun 11, 2026
8040e5a
decimal sum merge θΎ“ε…₯ row-field εΏ«θ·―εΎ„δΌ˜εŒ–
Jun 11, 2026
121d1f7
add refactor plan doc
Jun 11, 2026
c027478
update plan doc
Jun 11, 2026
fc44095
refactor inut and output
Jun 11, 2026
921da8b
fix style
Jun 12, 2026
c0619e7
minor refactors
Jun 12, 2026
b73168d
refactor hash aggr jit adapters and remove dead builtins: Replace dec…
Jun 12, 2026
f5bb503
remove uselesss declarations
Jun 12, 2026
2eab5b8
push down input/output shape from framework to individual aggregate f…
Jun 12, 2026
6ef9318
decouple hash aggr jit row runtime binding from aggregate kinds
Jun 12, 2026
05c4b17
decouple JitDecimalSumState and JitDecimalAvgState
Jun 12, 2026
62fbb70
remove uncessary codes
Jun 12, 2026
0af777a
fix benchmark crash caused by inconsistent accumulator kind and actua…
Jun 13, 2026
10d8047
remove useless checkinputnulls in llvm ir codegen
Jun 13, 2026
b327d62
remove useless code
Jun 13, 2026
b9c3aee
remove useless get null of row field
Jun 13, 2026
d537d39
seperate decimal_sum decimal_avg operations into xxxops
Jun 13, 2026
fca0131
remove unnecessary namespace prefix
Jun 13, 2026
5b44abc
reusing sumcount in both non-jit and jit way
Jun 13, 2026
e643c3d
reusing accumulate structures in both non-jit and jit way
Jun 13, 2026
0389f8b
fix code style
Jun 13, 2026
3faff33
rename runHashAggrJitChunks
Jun 13, 2026
e295039
minor refactor: split extract partial output and final output
Jun 13, 2026
d1e436c
refactor: remove canExtract & support bool/int128 for extract
Jun 13, 2026
daae495
add review document
Jun 13, 2026
88dc2bb
fix short/long decimal inconsistency in decimal avg/sum
Jun 13, 2026
671be25
add review report
Jun 13, 2026
34df9b8
fix failed uts
Jun 13, 2026
64ae0c3
part1: refactor HashAggrJitDescriptor
Jun 14, 2026
35f2b4c
part2: refactor HashAggrJitDescriptor
Jun 14, 2026
026d6e3
part3: refine HashAggrJitChunk function naming
Jun 14, 2026
5ad5516
part4: localize HashAggrJit scratch buffers
Jun 14, 2026
d199197
add more benchmark cases
Jun 15, 2026
57bcf43
fix failed uts
Jun 15, 2026
ebf11e7
enable not emit frame pointer to use perf
Jun 15, 2026
a6ff429
enabled profiling jit with intel vtune
Jun 15, 2026
64f7beb
fix diff in decimal sum
Jun 15, 2026
380fd84
fix diff in decimal avg
Jun 15, 2026
5bff5c5
add test config
Jun 15, 2026
89fe122
fix another decimal sum bug
Jun 15, 2026
7fc6b85
change hash aggr jit log level
Jun 16, 2026
c6acb2c
don't rely on numNulls_ when hash aggr jit is enabled
Jun 17, 2026
2828261
add more metrics about hash aggr jit
Jun 17, 2026
34c7bd4
hash aggr jit: compile chunks asynchronously in parallel
Jun 17, 2026
cec3198
remove useless ir dump and function veriry which could reduce jit com…
Jun 17, 2026
f02eda5
remove useless docs
Jun 17, 2026
24cc7fe
fix tidy
Jun 21, 2026
3f4a043
fix diff
Jun 22, 2026
aa589fc
fix diff caused by basevector::numNulls_
Jun 22, 2026
62c8f52
remove useless codes
Jun 22, 2026
225055b
fix aggJitCodegenTimeNs zero issue
Jun 22, 2026
4e5e493
remove useless logs
Jun 22, 2026
23390a0
remove useless logs
Jun 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,16 @@ if(${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsigned-char")
endif()

# Keep frame pointers and avoid sibling-call optimization so perf can fully
# unwind stacks with the cheap frame-pointer call-graph (no DWARF needed).
# Off by default; enable with -DBOLT_ENABLE_FRAME_POINTER=ON when profiling.
option(BOLT_ENABLE_FRAME_POINTER
"Preserve frame pointers for perf stack unwinding" OFF)
if(BOLT_ENABLE_FRAME_POINTER)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -fno-optimize-sibling-calls")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -fno-optimize-sibling-calls")
endif()

# Under Ninja, we are able to designate certain targets large enough to require restricted
# parallelism.
if("${MAX_HIGH_MEM_JOBS}")
Expand Down
19 changes: 19 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,11 @@ BUILD_TYPE=Release
BOLT_BUILD_BENCHMARKS ?= "OFF"
# Control whether to build tests with coverage instrumentation
BOLT_BUILD_TESTING_WITH_COVERAGE ?= "OFF"
# Control whether to keep frame pointers (for perf stack unwinding)
BOLT_ENABLE_FRAME_POINTER ?= "OFF"
# Control whether to report JIT symbols to Intel VTune (jitprofiling)
BOLT_ENABLE_VTUNE_JIT ?= "OFF"
VTUNE_SDK_DIR ?=
# -----------------------------------------------------------------

# TODO: remove `BUILD_USER` and `BUILD_CHANNEL`
Expand Down Expand Up @@ -204,6 +209,9 @@ conan_build: conan_install
NUM_THREADS=$(NUM_THREADS) \
BOLT_BUILD_BENCHMARKS=${BOLT_BUILD_BENCHMARKS} \
BOLT_BUILD_TESTING_WITH_COVERAGE=${BOLT_BUILD_TESTING_WITH_COVERAGE} \
BOLT_ENABLE_FRAME_POINTER=${BOLT_ENABLE_FRAME_POINTER} \
BOLT_ENABLE_VTUNE_JIT=${BOLT_ENABLE_VTUNE_JIT} \
VTUNE_SDK_DIR=${VTUNE_SDK_DIR} \
conan build ../.. --name=bolt --version=${BUILD_VERSION} --user=${BUILD_USER} --channel=${BUILD_CHANNEL} \
-s llvm-core/*:build_type=Release \
-s "&:build_type=${BUILD_TYPE}" \
Expand Down Expand Up @@ -306,6 +314,17 @@ benchmarks-build-spark:
benchmarks-build-relwithdebinfo:
$(MAKE) conan_build BUILD_TYPE=RelWithDebInfo BOLT_BUILD_BENCHMARKS="ON" CONAN_CONFIG=" -c bolt/*:tools.build:skip_test=False" CONAN_OPTIONS="-o bolt/*:spark_compatible=False -o bolt/*:enable_testutil=True -o bolt/*:enable_perf=True"

# Same as benchmarks-build-spark but keeps frame pointers so perf can unwind
# stacks with the cheap frame-pointer call-graph (no DWARF needed).
benchmarks-build-spark-profile:
$(MAKE) conan_build BUILD_TYPE=Release BOLT_BUILD_BENCHMARKS="ON" BOLT_ENABLE_FRAME_POINTER="ON" CONAN_CONFIG=" -c bolt/*:tools.build:skip_test=False" CONAN_OPTIONS="-o bolt/*:spark_compatible=True -o bolt/*:enable_testutil=True -o bolt/*:enable_perf=True"

# Same as benchmarks-build-spark-profile but also reports JIT symbols to Intel
# VTune (libjitprofiling). Override the SDK path with VTUNE_SDK_DIR=... if it is
# not under the default /opt/intel/oneapi/vtune/2023.2.0/sdk.
benchmarks-build-spark-vtune:
$(MAKE) conan_build BUILD_TYPE=Release BOLT_BUILD_BENCHMARKS="ON" BOLT_ENABLE_FRAME_POINTER="ON" BOLT_ENABLE_VTUNE_JIT="ON" VTUNE_SDK_DIR="${VTUNE_SDK_DIR}" CONAN_CONFIG=" -c bolt/*:tools.build:skip_test=False" CONAN_OPTIONS="-o bolt/*:spark_compatible=True -o bolt/*:enable_testutil=True -o bolt/*:enable_perf=True"

unittest_debug: unittest
unittest: debug_with_test
ctest --test-dir $(BUILD_BASE_DIR)/Debug --timeout 7200 -j $(NUM_THREADS) --output-on-failure
Expand Down
7 changes: 7 additions & 0 deletions bolt/common/base/AggregationStats.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,12 @@ struct AggregationStats {
uint64_t aggOutputTimeNs{0};
uint64_t aggProbeBypassTimeNs{0};
uint64_t aggProbeBypassCount{0};
// Hash aggregation JIT fine-grained timing.
// One-time codegen (LLVM compile) time for the JIT plan.
uint64_t aggJitCodegenTimeNs{0};
// JIT-executed part of the agg function update time.
uint64_t aggFunctionJitTimeNs{0};
// JIT-executed part of the extracting groups time.
uint64_t aggExtractGroupsJitTimeNs{0};
};
} // namespace bytedance::bolt::common
18 changes: 18 additions & 0 deletions bolt/core/QueryConfig.h
Original file line number Diff line number Diff line change
Expand Up @@ -652,6 +652,12 @@ class QueryConfig {
*/
static constexpr const char* kJitLevel = "jit.level";

static constexpr const char* kHashAggrJitEnabled = "hashaggr.jit.enabled";
static constexpr const char* kHashAggrJitMinFuseWidth =
"hashaggr.jit.min_fuse_width";
static constexpr const char* kHashAggrJitMaxFuseWidth =
"hashaggr.jit.max_fuse_width";

// expired, to deleted later
static constexpr const char* kBoltJitEnabled = "bolt.jit.enabled";
// For morsel-driven Bolt
Expand Down Expand Up @@ -1606,6 +1612,18 @@ class QueryConfig {
return flag & 1;
}

bool enableHashAggrJit() const {
return get<bool>(kHashAggrJitEnabled, true);
}

int32_t hashAggrJitMinFuseWidth() const {
return get<int32_t>(kHashAggrJitMinFuseWidth, 1);
}

int32_t hashAggrJitMaxFuseWidth() const {
return get<int32_t>(kHashAggrJitMaxFuseWidth, 16);
}

int exceptionTraceLevel() const {
return get<int>(kExceptionTraceLevel, 1);
}
Expand Down
23 changes: 23 additions & 0 deletions bolt/exec/Aggregate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -321,4 +321,27 @@ void Aggregate::clearInternal() {
numNulls_ = 0;
}

#ifdef ENABLE_BOLT_JIT
bool Aggregate::supportsHashAggrJit(
const jit::HashAggrJitPlanContext& /*context*/) const {
return false;
}

std::optional<jit::HashAggrJitDescriptor> Aggregate::createHashAggrJitDescriptor(
const jit::HashAggrJitPlanContext& /*context*/) const {
return std::nullopt;
}

jit::HashAggrJitSlot Aggregate::createHashAggrJitSlot(
int32_t aggregateIndex,
const jit::HashAggrJitDescriptor& descriptor) const {
return jit::HashAggrJitSlot{
.aggregateIndex = aggregateIndex,
.offset = accumulatorOffset(),
.nullByte = accumulatorNullByte(),
.nullMask = accumulatorNullMask(),
.desc = descriptor};
}
#endif

} // namespace bytedance::bolt::exec
74 changes: 62 additions & 12 deletions bolt/exec/Aggregate.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,17 @@
#include <folly/CPortability.h>
#include <folly/Synchronized.h>

#include <optional>

#include "bolt/common/memory/HashStringAllocator.h"
#include "bolt/core/PlanNode.h"
#include "bolt/core/QueryConfig.h"
#include "bolt/exec/AggregateUtil.h"
#include "bolt/expression/FunctionSignature.h"
#include "bolt/functions/InlineFlatten.h"
#ifdef ENABLE_BOLT_JIT
#include "bolt/jit/aggregation/HashAggrJitTypes.h"
#endif
#include "bolt/vector/BaseVector.h"
namespace bytedance::bolt::core {
class ExpressionEvaluator;
Expand Down Expand Up @@ -66,6 +71,18 @@ class Aggregate {
return resultType_;
}

int32_t accumulatorOffset() const {
return offset_;
}

int32_t accumulatorNullByte() const {
return nullByte_;
}

uint8_t accumulatorNullMask() const {
return nullMask_;
}

// Returns the fixed number of bytes the accumulator takes on a group
// row. Variable width accumulators will reference the variable
// width part of the state from the fixed part.
Expand Down Expand Up @@ -100,6 +117,18 @@ class Aggregate {
return false;
}

#ifdef ENABLE_BOLT_JIT
virtual bool supportsHashAggrJit(
const jit::HashAggrJitPlanContext& context) const;

virtual std::optional<jit::HashAggrJitDescriptor> createHashAggrJitDescriptor(
const jit::HashAggrJitPlanContext& context) const;

jit::HashAggrJitSlot createHashAggrJitSlot(
int32_t aggregateIndex,
const jit::HashAggrJitDescriptor& descriptor) const;
#endif

void setAllocator(HashStringAllocator* allocator) {
setAllocatorInternal(allocator);
pool_ = allocator->pool();
Expand Down Expand Up @@ -133,6 +162,10 @@ class Aggregate {
setOffsetsInternal(offset, nullByte, nullMask, rowSizeOffset);
}

void markNullCountUnknown() {
numNulls_ = std::nullopt;
}

// Initializes null flags and accumulators for newly encountered groups. This
// function should be called only once for each group.
//
Expand Down Expand Up @@ -360,7 +393,15 @@ class Aggregate {
}

bool isNull(char* group) const {
return numNulls_ && (group[nullByte_] & nullMask_);
return mayHaveNulls() && (group[nullByte_] & nullMask_);
}

bool hasNoNulls() const {
return numNulls_.has_value() && *numNulls_ == 0;
}

bool mayHaveNulls() const {
return !numNulls_.has_value() || *numNulls_ > 0;
}

// Sets null flag for all specified groups to true.
Expand All @@ -369,26 +410,33 @@ class Aggregate {
for (auto i : indices) {
groups[i][nullByte_] |= nullMask_;
}
numNulls_ += indices.size();
if (numNulls_.has_value()) {
*numNulls_ += indices.size();
}
}

inline bool setNull(char* group) {
if (group[nullByte_] & nullMask_) {
return false;
}
group[nullByte_] |= nullMask_;
++numNulls_;
if (numNulls_.has_value()) {
++*numNulls_;
}
return true;
}

inline bool clearNull(char* group) {
if (numNulls_) {
uint8_t mask = group[nullByte_];
if (mask & nullMask_) {
group[nullByte_] = mask & ~nullMask_;
--numNulls_;
return true;
if (!mayHaveNulls()) {
return false;
}
uint8_t mask = group[nullByte_];
if (mask & nullMask_) {
group[nullByte_] = mask & ~nullMask_;
if (numNulls_.has_value()) {
--*numNulls_;
}
return true;
}
return false;
}
Expand Down Expand Up @@ -449,9 +497,11 @@ class Aggregate {
int32_t rowSizeOffset_ = 0;

// Number of null accumulators in the current state of the aggregation
// operator for this aggregate. If 0, clearing the null as part of update
// is not needed.
uint64_t numNulls_ = 0;
// operator for this aggregate.
// - 0 => known that no group is null
// - N > 0 => known exact null count
// - nullopt => unknown; must rely on per-group null bit
std::optional<uint64_t> numNulls_{0};
HashStringAllocator* allocator_{nullptr};
memory::MemoryPool* pool_{nullptr};
std::shared_ptr<core::ExpressionEvaluator> expressionEvaluator_{nullptr};
Expand Down
15 changes: 15 additions & 0 deletions bolt/exec/AggregateCompanionAdapter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,15 @@
#include "bolt/exec/RowContainer.h"
#include "bolt/expression/SignatureBinder.h"
#include "bolt/functions/lib/aggregates/AggregateToIntermediate.h"

namespace bytedance::bolt::exec {

void AggregateCompanionFunctionBase::setOffsetsInternal(
int32_t offset,
int32_t nullByte,
uint8_t nullMask,
int32_t rowSizeOffset) {
Aggregate::setOffsetsInternal(offset, nullByte, nullMask, rowSizeOffset);
fn_->setOffsets(offset, nullByte, nullMask, rowSizeOffset);
}

Expand All @@ -65,6 +67,19 @@ bool AggregateCompanionFunctionBase::supportsToIntermediate() const {
return fn_->supportsToIntermediate();
}

#ifdef ENABLE_BOLT_JIT
bool AggregateCompanionFunctionBase::supportsHashAggrJit(
const jit::HashAggrJitPlanContext& context) const {
return fn_->supportsHashAggrJit(rewriteHashAggrJitContext(context));
}

std::optional<jit::HashAggrJitDescriptor>
AggregateCompanionFunctionBase::createHashAggrJitDescriptor(
const jit::HashAggrJitPlanContext& context) const {
return fn_->createHashAggrJitDescriptor(rewriteHashAggrJitContext(context));
}
#endif

bool AggregateCompanionFunctionBase::supportAccumulatorSerde() const {
return fn_->supportAccumulatorSerde();
}
Expand Down
Loading
Loading