perf(expression): Improve cast from date by yingsu00 · Pull Request #2173 · IBM/velox

yingsu00 · 2026-06-23T03:42:40Z

No description provided.

…essions Adds a fast path in Expr::evalWithMemo: on the second sighting of a dictionary base, when the expression is cheap to re-evaluate and non-throwing, fill every position of the base into dictionaryCache_ in one shot. Subsequent batches over the same base then hit the cache-covers-base bypass in peelEncodings and return the cached vector directly without per-row work. The classification of "cheap" lives on Expr::isCheapToReevaluate(): * Expr's base implementation (in Expr.cpp) returns true for function calls whose registered name is in the curated cheapFunctionNames() set. The set is conservative: only entries that are both cheap per-row AND non-throwing on plausible inputs are included. Casts, arithmetic with divide / mod, parsing functions, regex, json, and crypto are deliberately omitted. Date / time accessors, date / time arithmetic and formatting, simple string ops, and non-throwing math (NaN / Inf instead of exceptions) are included. This covers common expressions like date_format(...), date_trunc(...), substr(...), length(...) over dictionary-encoded inputs. * CastExpr overrides to return true for fast numeric upcasts, DATE -> TIMESTAMP, and DATE -> VARCHAR. The eager-fill block also exposes a deselect-vs-full-reeval choice: the SelectivityVector deselect of already-cached positions is O(base / 64) and the resulting sparse toFill makes the subsequent evalWithNulls iterate set-bit-by-set-bit instead of running over a dense range. When only a minority of base positions are cached, the extra eval cost of re-running on cached positions is small compared to the deselect + sparse-iteration cost; full re-eval is faster. When the majority is cached, deselect saves enough work to be worth it. Threshold at 50% (cachedCount * 2 >= baseSize). The same deselect-or-not decision drives both toFill (the rows to evaluate) and writable (the positions ensureWritable must make mutable on dictionaryCache_). Why now: a production query with `date_format(CAST(date_trunc(...)) AS timestamp), '%Y-%m-%d')` on a hot column was showing ~2% of total process CPU in FlatVector<StringView>::copy -> acquireSharedStringBuffers -> addStringBuffer. The atomic refcount increment in intrusive_ptr<Buffer>::push_back is a full memory barrier; on a Buffer shared across drivers the cache line bounces and each increment stalls hundreds of cycles. The bypass in peelEncodings (separate commit) already sidesteps that entire chain on cache-hit batches - but the bypass only fires after eager-fill has populated the whole base. Without eager-fill for date_format, the cache filled only incrementally and the bypass never reached the "covers base" threshold for many production-sized bases. 1626/1626 velox_expression_test pass.

…st loops Several CastExpr kernels evaluated a per-row `if (timeZone)` check inside the per-row loop even though the timezone is fixed for the whole call. Split each affected kernel into two specialized loops chosen once based on whether a timezone is configured. The inner body of each loop is now straight-line: no per-row condition, more inlinable, and more amenable to vectorization across the row range. Sites updated: - castFromDate(TIMESTAMP): one loop applies hooks_->castDateTimestampToGMT unconditionally; the other does pure date->ms arithmetic. - castToTime(VARCHAR): one loop calls timeZone->to_local; the other uses systemTime % kMillisInDay directly. The shared post-processing (sign fixup, valueToString, setNoCopy) is factored into a small inline body. No behavior change. CastExprTest passes unchanged.

The CastExpr DATE->TIMESTAMP kernel used to iterate the selected rows itself and dispatch through a per-row virtual hook for the optional timezone adjustment. Even after caching hooks_.get() and marking PrestoCastHooks final, every row paid one indirect virtual call that the compiler could not see into - inlining and any vectorization of Timestamp::fromMillis + toGMT + flat store stayed off the table. Add a per-batch hook castDateToTimestampVector on CastHooks that owns the row loop. The default implementation in the base class iterates and calls the existing per-row castDateTimestampToGMT, so subclasses that don't override get equivalent behavior. PrestoCastHooks overrides the new method and calls Timestamp::toGMT directly - the loop body is now concrete code in PrestoCastHooks.cpp that the compiler can inline and auto-vectorize the parts of (the multiply, fromMillis, store). CastExpr.cpp's call site collapses to a single per-batch virtual dispatch. Benchmark (release, WideningCastBenchmark on this host): Entry before after delta FLAT_DateToTimestamp(1000_100) 12.57 10.44 -17% FLAT_DateToTimestamp(1000_1000) 2.31 1.73 -25% FLAT_DateToTimestamp(1000_10000) 1.26 0.91 -28% The flat path shows the biggest gains because no memoization or eager-fill is short-circuiting the loop - the per-row virtual call overhead was the dominant non-arithmetic cost. The dict path gains are smaller because the eager-fill path already collapses subsequent batches to a wrap. No behavior change. CastExprTest passes unchanged.

The DATE -> VARCHAR kernel previously did three allocations per row to write 10 characters: auto output = DATE()->toString(days); // std::string (sometimes SSO, // sometimes heap). auto writer = StringWriter(result, row); writer.resize(output.size()); // First call allocates a 32KB // buffer via getBufferWithSpace. ::memcpy(writer.data(), output.data(), output.size()); writer.finalize(); // setNoCopy(StringView(buf, 10)) // - inline because 10 <= 12, // so the 32KB buffer is held // by the vector but its bytes // are never referenced. The 32KB buffer is pure waste: every successful DATE -> VARCHAR output is "YYYY-MM-DD" (10 bytes) or signed-year variants up to ~12 bytes, all of which fit inline in StringView's 12-byte payload and thus copy out of any source buffer. Format directly into a 17-byte stack array via Timestamp::tmToStringView - the same low-level routine DateType::toIso8601 already uses - and construct the StringView inline. For the common case (year in roughly [-9999, 999999], i.e. every realistic DATE value), the StringView ctor copies the 10-12 bytes into its own inline storage and the stack buffer is unreferenced after the row. Only when the formatted output exceeds 12 bytes do we fall back to copying into a result-owned buffer via getRawStringBufferWithSpace, preserving correctness for edge-case large years. Folly benchmark (release, WideningCastBenchmark) FLAT path - no memoization or eager-fill confounders: Entry before after speedup FLAT_DateToVarchar(1000_100) 64.45 34.79 1.85x FLAT_DateToVarchar(1000_1000) 41.24 26.82 1.54x DICT entries hit the eager-fill path (cache fills once, subsequent batches are O(1) wraps) so the per-row formatting cost only counts during the initial fill; their numbers move less. CastExprTest passes unchanged.

yingsu00 added 4 commits June 18, 2026 21:42

yingsu00 requested review from rui-mo and xin-zhang2 June 23, 2026 03:42

yingsu00 self-assigned this Jun 23, 2026

yingsu00 added the bolt label Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(expression): Improve cast from date#2173

perf(expression): Improve cast from date#2173
yingsu00 wants to merge 4 commits into
IBM:boltfrom
yingsu00:cast-perf-04-cast-from-date

yingsu00 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yingsu00 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant