perf(expression): Improve cast from date#2173
Open
yingsu00 wants to merge 4 commits into
Open
Conversation
…essions Adds a fast path in Expr::evalWithMemo: on the second sighting of a dictionary base, when the expression is cheap to re-evaluate and non-throwing, fill every position of the base into dictionaryCache_ in one shot. Subsequent batches over the same base then hit the cache-covers-base bypass in peelEncodings and return the cached vector directly without per-row work. The classification of "cheap" lives on Expr::isCheapToReevaluate(): * Expr's base implementation (in Expr.cpp) returns true for function calls whose registered name is in the curated cheapFunctionNames() set. The set is conservative: only entries that are both cheap per-row AND non-throwing on plausible inputs are included. Casts, arithmetic with divide / mod, parsing functions, regex, json, and crypto are deliberately omitted. Date / time accessors, date / time arithmetic and formatting, simple string ops, and non-throwing math (NaN / Inf instead of exceptions) are included. This covers common expressions like date_format(...), date_trunc(...), substr(...), length(...) over dictionary-encoded inputs. * CastExpr overrides to return true for fast numeric upcasts, DATE -> TIMESTAMP, and DATE -> VARCHAR. The eager-fill block also exposes a deselect-vs-full-reeval choice: the SelectivityVector deselect of already-cached positions is O(base / 64) and the resulting sparse toFill makes the subsequent evalWithNulls iterate set-bit-by-set-bit instead of running over a dense range. When only a minority of base positions are cached, the extra eval cost of re-running on cached positions is small compared to the deselect + sparse-iteration cost; full re-eval is faster. When the majority is cached, deselect saves enough work to be worth it. Threshold at 50% (cachedCount * 2 >= baseSize). The same deselect-or-not decision drives both toFill (the rows to evaluate) and writable (the positions ensureWritable must make mutable on dictionaryCache_). Why now: a production query with `date_format(CAST(date_trunc(...)) AS timestamp), '%Y-%m-%d')` on a hot column was showing ~2% of total process CPU in FlatVector<StringView>::copy -> acquireSharedStringBuffers -> addStringBuffer. The atomic refcount increment in intrusive_ptr<Buffer>::push_back is a full memory barrier; on a Buffer shared across drivers the cache line bounces and each increment stalls hundreds of cycles. The bypass in peelEncodings (separate commit) already sidesteps that entire chain on cache-hit batches - but the bypass only fires after eager-fill has populated the whole base. Without eager-fill for date_format, the cache filled only incrementally and the bypass never reached the "covers base" threshold for many production-sized bases. 1626/1626 velox_expression_test pass.
…st loops Several CastExpr kernels evaluated a per-row `if (timeZone)` check inside the per-row loop even though the timezone is fixed for the whole call. Split each affected kernel into two specialized loops chosen once based on whether a timezone is configured. The inner body of each loop is now straight-line: no per-row condition, more inlinable, and more amenable to vectorization across the row range. Sites updated: - castFromDate(TIMESTAMP): one loop applies hooks_->castDateTimestampToGMT unconditionally; the other does pure date->ms arithmetic. - castToTime(VARCHAR): one loop calls timeZone->to_local; the other uses systemTime % kMillisInDay directly. The shared post-processing (sign fixup, valueToString, setNoCopy) is factored into a small inline body. No behavior change. CastExprTest passes unchanged.
The CastExpr DATE->TIMESTAMP kernel used to iterate the selected rows itself and dispatch through a per-row virtual hook for the optional timezone adjustment. Even after caching hooks_.get() and marking PrestoCastHooks final, every row paid one indirect virtual call that the compiler could not see into - inlining and any vectorization of Timestamp::fromMillis + toGMT + flat store stayed off the table. Add a per-batch hook castDateToTimestampVector on CastHooks that owns the row loop. The default implementation in the base class iterates and calls the existing per-row castDateTimestampToGMT, so subclasses that don't override get equivalent behavior. PrestoCastHooks overrides the new method and calls Timestamp::toGMT directly - the loop body is now concrete code in PrestoCastHooks.cpp that the compiler can inline and auto-vectorize the parts of (the multiply, fromMillis, store). CastExpr.cpp's call site collapses to a single per-batch virtual dispatch. Benchmark (release, WideningCastBenchmark on this host): Entry before after delta FLAT_DateToTimestamp(1000_100) 12.57 10.44 -17% FLAT_DateToTimestamp(1000_1000) 2.31 1.73 -25% FLAT_DateToTimestamp(1000_10000) 1.26 0.91 -28% The flat path shows the biggest gains because no memoization or eager-fill is short-circuiting the loop - the per-row virtual call overhead was the dominant non-arithmetic cost. The dict path gains are smaller because the eager-fill path already collapses subsequent batches to a wrap. No behavior change. CastExprTest passes unchanged.
The DATE -> VARCHAR kernel previously did three allocations per row to
write 10 characters:
auto output = DATE()->toString(days); // std::string (sometimes SSO,
// sometimes heap).
auto writer = StringWriter(result, row);
writer.resize(output.size()); // First call allocates a 32KB
// buffer via getBufferWithSpace.
::memcpy(writer.data(), output.data(), output.size());
writer.finalize(); // setNoCopy(StringView(buf, 10))
// - inline because 10 <= 12,
// so the 32KB buffer is held
// by the vector but its bytes
// are never referenced.
The 32KB buffer is pure waste: every successful DATE -> VARCHAR output
is "YYYY-MM-DD" (10 bytes) or signed-year variants up to ~12 bytes, all
of which fit inline in StringView's 12-byte payload and thus copy out
of any source buffer.
Format directly into a 17-byte stack array via Timestamp::tmToStringView
- the same low-level routine DateType::toIso8601 already uses - and
construct the StringView inline. For the common case (year in roughly
[-9999, 999999], i.e. every realistic DATE value), the StringView ctor
copies the 10-12 bytes into its own inline storage and the stack
buffer is unreferenced after the row. Only when the formatted output
exceeds 12 bytes do we fall back to copying into a result-owned
buffer via getRawStringBufferWithSpace, preserving correctness for
edge-case large years.
Folly benchmark (release, WideningCastBenchmark) FLAT path - no
memoization or eager-fill confounders:
Entry before after speedup
FLAT_DateToVarchar(1000_100) 64.45 34.79 1.85x
FLAT_DateToVarchar(1000_1000) 41.24 26.82 1.54x
DICT entries hit the eager-fill path (cache fills once, subsequent
batches are O(1) wraps) so the per-row formatting cost only counts
during the initial fill; their numbers move less.
CastExprTest passes unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.