perf(expression): Eager full-base fill in evalWithMemo for cheap expressions#2172
Open
yingsu00 wants to merge 1 commit into
Open
perf(expression): Eager full-base fill in evalWithMemo for cheap expressions#2172yingsu00 wants to merge 1 commit into
yingsu00 wants to merge 1 commit into
Conversation
…essions Adds a fast path in Expr::evalWithMemo: on the second sighting of a dictionary base, when the expression is cheap to re-evaluate and non-throwing, fill every position of the base into dictionaryCache_ in one shot. Subsequent batches over the same base then hit the cache-covers-base bypass in peelEncodings and return the cached vector directly without per-row work. The classification of "cheap" lives on Expr::isCheapToReevaluate(): * Expr's base implementation (in Expr.cpp) returns true for function calls whose registered name is in the curated cheapFunctionNames() set. The set is conservative: only entries that are both cheap per-row AND non-throwing on plausible inputs are included. Casts, arithmetic with divide / mod, parsing functions, regex, json, and crypto are deliberately omitted. Date / time accessors, date / time arithmetic and formatting, simple string ops, and non-throwing math (NaN / Inf instead of exceptions) are included. This covers common expressions like date_format(...), date_trunc(...), substr(...), length(...) over dictionary-encoded inputs. * CastExpr overrides to return true for fast numeric upcasts, DATE -> TIMESTAMP, and DATE -> VARCHAR. The eager-fill block also exposes a deselect-vs-full-reeval choice: the SelectivityVector deselect of already-cached positions is O(base / 64) and the resulting sparse toFill makes the subsequent evalWithNulls iterate set-bit-by-set-bit instead of running over a dense range. When only a minority of base positions are cached, the extra eval cost of re-running on cached positions is small compared to the deselect + sparse-iteration cost; full re-eval is faster. When the majority is cached, deselect saves enough work to be worth it. Threshold at 50% (cachedCount * 2 >= baseSize). The same deselect-or-not decision drives both toFill (the rows to evaluate) and writable (the positions ensureWritable must make mutable on dictionaryCache_). Why now: a production query with `date_format(CAST(date_trunc(...)) AS timestamp), '%Y-%m-%d')` on a hot column was showing ~2% of total process CPU in FlatVector<StringView>::copy -> acquireSharedStringBuffers -> addStringBuffer. The atomic refcount increment in intrusive_ptr<Buffer>::push_back is a full memory barrier; on a Buffer shared across drivers the cache line bounces and each increment stalls hundreds of cycles. The bypass in peelEncodings (separate commit) already sidesteps that entire chain on cache-hit batches - but the bypass only fires after eager-fill has populated the whole base. Without eager-fill for date_format, the cache filled only incrementally and the bypass never reached the "covers base" threshold for many production-sized bases. 1626/1626 velox_expression_test pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a fast path in Expr::evalWithMemo: on the second sighting of a dictionary base, when the expression is cheap to re-evaluate and non-throwing, fill every position of the base into dictionaryCache_ in one shot. Subsequent batches over the same base then hit the cache-covers-base bypass in peelEncodings and return the cached vector directly without per-row work.
The classification of "cheap" lives on Expr::isCheapToReevaluate():
Expr's base implementation (in Expr.cpp) returns true for function calls whose registered name is in the curated cheapFunctionNames() set. The set is conservative: only entries that are both cheap per-row AND non-throwing on plausible inputs are included. Casts, arithmetic with divide / mod, parsing functions, regex, json, and crypto are deliberately omitted. Date / time accessors, date / time arithmetic and formatting, simple string ops, and non-throwing math (NaN / Inf instead of exceptions) are included. This covers common expressions like date_format(...), date_trunc(...), substr(...), length(...) over dictionary-encoded inputs.
CastExpr overrides to return true for fast numeric upcasts, DATE -> TIMESTAMP, and DATE -> VARCHAR.
The eager-fill block also exposes a deselect-vs-full-reeval choice: the SelectivityVector deselect of already-cached positions is O(base / 64) and the resulting sparse toFill makes the subsequent evalWithNulls iterate set-bit-by-set-bit instead of running over a dense range. When only a minority of base positions are cached, the extra eval cost of re-running on cached positions is small compared to the deselect + sparse-iteration cost; full re-eval is faster. When the majority is cached, deselect saves enough work to be worth it. Threshold at 50% (cachedCount * 2 >= baseSize). The same deselect-or-not decision drives both toFill (the rows to evaluate) and writable (the positions ensureWritable must make mutable on dictionaryCache_).
Why now: a production query with
date_format(CAST(date_trunc(...)) AS timestamp), '%Y-%m-%d')on a hot column was showing ~2% of total process CPU in FlatVector::copy ->acquireSharedStringBuffers -> addStringBuffer. The atomic refcount increment in intrusive_ptr::push_back is a full memory barrier; on a Buffer shared across drivers the cache line bounces and each increment stalls hundreds of cycles. The bypass in peelEncodings (separate commit) already sidesteps that entire chain on cache-hit batches - but the bypass only fires after eager-fill has populated the whole base. Without eager-fill for date_format, the cache filled only incrementally and the bypass never reached the "covers base" threshold for many production-sized bases.
1626/1626 velox_expression_test pass.