Skip to content

perf(expression): Eager full-base fill in evalWithMemo for cheap expressions#2172

Open
yingsu00 wants to merge 1 commit into
IBM:boltfrom
yingsu00:cast-perf-03-memo-eager-fill
Open

perf(expression): Eager full-base fill in evalWithMemo for cheap expressions#2172
yingsu00 wants to merge 1 commit into
IBM:boltfrom
yingsu00:cast-perf-03-memo-eager-fill

Conversation

@yingsu00

Copy link
Copy Markdown
Collaborator

Adds a fast path in Expr::evalWithMemo: on the second sighting of a dictionary base, when the expression is cheap to re-evaluate and non-throwing, fill every position of the base into dictionaryCache_ in one shot. Subsequent batches over the same base then hit the cache-covers-base bypass in peelEncodings and return the cached vector directly without per-row work.

The classification of "cheap" lives on Expr::isCheapToReevaluate():

  • Expr's base implementation (in Expr.cpp) returns true for function calls whose registered name is in the curated cheapFunctionNames() set. The set is conservative: only entries that are both cheap per-row AND non-throwing on plausible inputs are included. Casts, arithmetic with divide / mod, parsing functions, regex, json, and crypto are deliberately omitted. Date / time accessors, date / time arithmetic and formatting, simple string ops, and non-throwing math (NaN / Inf instead of exceptions) are included. This covers common expressions like date_format(...), date_trunc(...), substr(...), length(...) over dictionary-encoded inputs.

  • CastExpr overrides to return true for fast numeric upcasts, DATE -> TIMESTAMP, and DATE -> VARCHAR.

The eager-fill block also exposes a deselect-vs-full-reeval choice: the SelectivityVector deselect of already-cached positions is O(base / 64) and the resulting sparse toFill makes the subsequent evalWithNulls iterate set-bit-by-set-bit instead of running over a dense range. When only a minority of base positions are cached, the extra eval cost of re-running on cached positions is small compared to the deselect + sparse-iteration cost; full re-eval is faster. When the majority is cached, deselect saves enough work to be worth it. Threshold at 50% (cachedCount * 2 >= baseSize). The same deselect-or-not decision drives both toFill (the rows to evaluate) and writable (the positions ensureWritable must make mutable on dictionaryCache_).

Why now: a production query with
date_format(CAST(date_trunc(...)) AS timestamp), '%Y-%m-%d') on a hot column was showing ~2% of total process CPU in FlatVector::copy ->
acquireSharedStringBuffers -> addStringBuffer. The atomic refcount increment in intrusive_ptr::push_back is a full memory barrier; on a Buffer shared across drivers the cache line bounces and each increment stalls hundreds of cycles. The bypass in peelEncodings (separate commit) already sidesteps that entire chain on cache-hit batches - but the bypass only fires after eager-fill has populated the whole base. Without eager-fill for date_format, the cache filled only incrementally and the bypass never reached the "covers base" threshold for many production-sized bases.

1626/1626 velox_expression_test pass.

…essions

Adds a fast path in Expr::evalWithMemo: on the second sighting of a
dictionary base, when the expression is cheap to re-evaluate and
non-throwing, fill every position of the base into dictionaryCache_
in one shot. Subsequent batches over the same base then hit the
cache-covers-base bypass in peelEncodings and return the cached
vector directly without per-row work.

The classification of "cheap" lives on Expr::isCheapToReevaluate():

* Expr's base implementation (in Expr.cpp) returns true for function
  calls whose registered name is in the curated cheapFunctionNames()
  set. The set is conservative: only entries that are both cheap
  per-row AND non-throwing on plausible inputs are included. Casts,
  arithmetic with divide / mod, parsing functions, regex, json, and
  crypto are deliberately omitted. Date / time accessors,
  date / time arithmetic and formatting, simple string ops, and
  non-throwing math (NaN / Inf instead of exceptions) are included.
  This covers common expressions like date_format(...),
  date_trunc(...), substr(...), length(...) over dictionary-encoded
  inputs.

* CastExpr overrides to return true for fast numeric upcasts,
  DATE -> TIMESTAMP, and DATE -> VARCHAR.

The eager-fill block also exposes a deselect-vs-full-reeval choice:
the SelectivityVector deselect of already-cached positions is O(base
/ 64) and the resulting sparse toFill makes the subsequent
evalWithNulls iterate set-bit-by-set-bit instead of running over a
dense range. When only a minority of base positions are cached, the
extra eval cost of re-running on cached positions is small compared
to the deselect + sparse-iteration cost; full re-eval is faster.
When the majority is cached, deselect saves enough work to be worth
it. Threshold at 50% (cachedCount * 2 >= baseSize). The same
deselect-or-not decision drives both toFill (the rows to evaluate)
and writable (the positions ensureWritable must make mutable on
dictionaryCache_).

Why now: a production query with
`date_format(CAST(date_trunc(...)) AS timestamp), '%Y-%m-%d')` on a
hot column was showing ~2% of total process CPU in
FlatVector<StringView>::copy ->
acquireSharedStringBuffers -> addStringBuffer. The atomic refcount
increment in intrusive_ptr<Buffer>::push_back is a full memory
barrier; on a Buffer shared across drivers the cache line bounces
and each increment stalls hundreds of cycles. The bypass in
peelEncodings (separate commit) already sidesteps that entire chain
on cache-hit batches - but the bypass only fires after eager-fill
has populated the whole base. Without eager-fill for date_format, the
cache filled only incrementally and the bypass never reached the
"covers base" threshold for many production-sized bases.

1626/1626 velox_expression_test pass.
@yingsu00 yingsu00 requested review from rui-mo and xin-zhang2 June 23, 2026 03:40
@yingsu00 yingsu00 self-assigned this Jun 23, 2026
@yingsu00 yingsu00 added the bolt label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant