Skip to content

perf(expression): Improve cast from date#2173

Open
yingsu00 wants to merge 4 commits into
IBM:boltfrom
yingsu00:cast-perf-04-cast-from-date
Open

perf(expression): Improve cast from date#2173
yingsu00 wants to merge 4 commits into
IBM:boltfrom
yingsu00:cast-perf-04-cast-from-date

Conversation

@yingsu00

Copy link
Copy Markdown
Collaborator

No description provided.

yingsu00 added 4 commits June 18, 2026 21:42
…essions

Adds a fast path in Expr::evalWithMemo: on the second sighting of a
dictionary base, when the expression is cheap to re-evaluate and
non-throwing, fill every position of the base into dictionaryCache_
in one shot. Subsequent batches over the same base then hit the
cache-covers-base bypass in peelEncodings and return the cached
vector directly without per-row work.

The classification of "cheap" lives on Expr::isCheapToReevaluate():

* Expr's base implementation (in Expr.cpp) returns true for function
  calls whose registered name is in the curated cheapFunctionNames()
  set. The set is conservative: only entries that are both cheap
  per-row AND non-throwing on plausible inputs are included. Casts,
  arithmetic with divide / mod, parsing functions, regex, json, and
  crypto are deliberately omitted. Date / time accessors,
  date / time arithmetic and formatting, simple string ops, and
  non-throwing math (NaN / Inf instead of exceptions) are included.
  This covers common expressions like date_format(...),
  date_trunc(...), substr(...), length(...) over dictionary-encoded
  inputs.

* CastExpr overrides to return true for fast numeric upcasts,
  DATE -> TIMESTAMP, and DATE -> VARCHAR.

The eager-fill block also exposes a deselect-vs-full-reeval choice:
the SelectivityVector deselect of already-cached positions is O(base
/ 64) and the resulting sparse toFill makes the subsequent
evalWithNulls iterate set-bit-by-set-bit instead of running over a
dense range. When only a minority of base positions are cached, the
extra eval cost of re-running on cached positions is small compared
to the deselect + sparse-iteration cost; full re-eval is faster.
When the majority is cached, deselect saves enough work to be worth
it. Threshold at 50% (cachedCount * 2 >= baseSize). The same
deselect-or-not decision drives both toFill (the rows to evaluate)
and writable (the positions ensureWritable must make mutable on
dictionaryCache_).

Why now: a production query with
`date_format(CAST(date_trunc(...)) AS timestamp), '%Y-%m-%d')` on a
hot column was showing ~2% of total process CPU in
FlatVector<StringView>::copy ->
acquireSharedStringBuffers -> addStringBuffer. The atomic refcount
increment in intrusive_ptr<Buffer>::push_back is a full memory
barrier; on a Buffer shared across drivers the cache line bounces
and each increment stalls hundreds of cycles. The bypass in
peelEncodings (separate commit) already sidesteps that entire chain
on cache-hit batches - but the bypass only fires after eager-fill
has populated the whole base. Without eager-fill for date_format, the
cache filled only incrementally and the bypass never reached the
"covers base" threshold for many production-sized bases.

1626/1626 velox_expression_test pass.
…st loops

Several CastExpr kernels evaluated a per-row `if (timeZone)` check inside
the per-row loop even though the timezone is fixed for the whole call.
Split each affected kernel into two specialized loops chosen once based
on whether a timezone is configured. The inner body of each loop is now
straight-line: no per-row condition, more inlinable, and more amenable
to vectorization across the row range.

Sites updated:
- castFromDate(TIMESTAMP): one loop applies hooks_->castDateTimestampToGMT
  unconditionally; the other does pure date->ms arithmetic.
- castToTime(VARCHAR): one loop calls timeZone->to_local; the other uses
  systemTime % kMillisInDay directly. The shared post-processing (sign
  fixup, valueToString, setNoCopy) is factored into a small inline body.

No behavior change. CastExprTest passes unchanged.
The CastExpr DATE->TIMESTAMP kernel used to iterate the selected rows
itself and dispatch through a per-row virtual hook for the optional
timezone adjustment. Even after caching hooks_.get() and marking
PrestoCastHooks final, every row paid one indirect virtual call that
the compiler could not see into - inlining and any vectorization of
Timestamp::fromMillis + toGMT + flat store stayed off the table.

Add a per-batch hook castDateToTimestampVector on CastHooks that owns
the row loop. The default implementation in the base class iterates
and calls the existing per-row castDateTimestampToGMT, so subclasses
that don't override get equivalent behavior. PrestoCastHooks overrides
the new method and calls Timestamp::toGMT directly - the loop body is
now concrete code in PrestoCastHooks.cpp that the compiler can inline
and auto-vectorize the parts of (the multiply, fromMillis, store).
CastExpr.cpp's call site collapses to a single per-batch virtual
dispatch.

Benchmark (release, WideningCastBenchmark on this host):

  Entry                                  before    after   delta
  FLAT_DateToTimestamp(1000_100)         12.57    10.44   -17%
  FLAT_DateToTimestamp(1000_1000)         2.31     1.73   -25%
  FLAT_DateToTimestamp(1000_10000)        1.26     0.91   -28%

The flat path shows the biggest gains because no memoization or
eager-fill is short-circuiting the loop - the per-row virtual call
overhead was the dominant non-arithmetic cost. The dict path gains
are smaller because the eager-fill path already collapses subsequent
batches to a wrap.

No behavior change. CastExprTest passes unchanged.
The DATE -> VARCHAR kernel previously did three allocations per row to
write 10 characters:

  auto output = DATE()->toString(days);   // std::string (sometimes SSO,
                                          //   sometimes heap).
  auto writer = StringWriter(result, row);
  writer.resize(output.size());           // First call allocates a 32KB
                                          //   buffer via getBufferWithSpace.
  ::memcpy(writer.data(), output.data(), output.size());
  writer.finalize();                      // setNoCopy(StringView(buf, 10))
                                          //   - inline because 10 <= 12,
                                          //   so the 32KB buffer is held
                                          //   by the vector but its bytes
                                          //   are never referenced.

The 32KB buffer is pure waste: every successful DATE -> VARCHAR output
is "YYYY-MM-DD" (10 bytes) or signed-year variants up to ~12 bytes, all
of which fit inline in StringView's 12-byte payload and thus copy out
of any source buffer.

Format directly into a 17-byte stack array via Timestamp::tmToStringView
- the same low-level routine DateType::toIso8601 already uses - and
construct the StringView inline. For the common case (year in roughly
[-9999, 999999], i.e. every realistic DATE value), the StringView ctor
copies the 10-12 bytes into its own inline storage and the stack
buffer is unreferenced after the row. Only when the formatted output
exceeds 12 bytes do we fall back to copying into a result-owned
buffer via getRawStringBufferWithSpace, preserving correctness for
edge-case large years.

Folly benchmark (release, WideningCastBenchmark) FLAT path - no
memoization or eager-fill confounders:

  Entry                                  before    after   speedup
  FLAT_DateToVarchar(1000_100)           64.45    34.79    1.85x
  FLAT_DateToVarchar(1000_1000)          41.24    26.82    1.54x

DICT entries hit the eager-fill path (cache fills once, subsequent
batches are O(1) wraps) so the per-row formatting cost only counts
during the initial fill; their numbers move less.

CastExprTest passes unchanged.
@yingsu00 yingsu00 requested review from rui-mo and xin-zhang2 June 23, 2026 03:42
@yingsu00 yingsu00 self-assigned this Jun 23, 2026
@yingsu00 yingsu00 added the bolt label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant