feat(serde): Coerce narrow encoding to widened type during PrestoSerializer deserialization#2169
Open
yingsu00 wants to merge 1 commit into
Open
feat(serde): Coerce narrow encoding to widened type during PrestoSerializer deserialization#2169yingsu00 wants to merge 1 commit into
yingsu00 wants to merge 1 commit into
Conversation
…alizer deserialization
When PushDownWidenCast rewrites a consumer fragment's source operator
(TableScan or RemoteSourceNode) to declare a wider type, the producer
fragment still emits narrow-typed bytes on the wire (its outputLayout
stays narrow). The native worker's Exchange operator deserializes those
pages via PrestoVectorSerde and must coerce narrow->wide on receive.
Previously the deserialization path threw on the encoding mismatch:
"Serialized encoding is not compatible with requested type: BIGINT.
Expected LONG_ARRAY. Got INT_ARRAY."
Two call sites of `checkTypeEncoding` are patched, mirroring the two-pass
deserialization in `readTopColumns`:
1. `readColumns` (main read pass): Add `tryReadWidenedColumn` which
reads the size + null header, then dispatches to `readWideningValues
<SourceT, TargetT, Converter>` to read narrow bytes from the stream
and write widened values into the result FlatVector<TargetT>.
2. `readStructNullsColumns` (null-tracking pre-pass, only runs when
`hasNestedStructs(childTypes)` is true): Add
`tryStructNullsSkipWidened` which skips `numValues * sizeof(SourceT)`
bytes. Critical: without this, the pre-pass threw before the main
pass ever ran on queries whose row schema includes nested ROW types
(real-world complex JOIN / aggregate plans hit this constantly).
Supported widening pairs (producer encoding -> consumer kind):
- BYTE_ARRAY (TINYINT) -> SMALLINT, INTEGER, BIGINT
- SHORT_ARRAY (SMALLINT) -> INTEGER, BIGINT
- INT_ARRAY (INTEGER) -> BIGINT
- INT_ARRAY (REAL) -> DOUBLE
- INT_ARRAY (DATE) -> TIMESTAMP (days * kSecondsInDay)
Ordering matters: the widening check is inserted *before* the existing
`tryReadNullColumn` check in `readColumns`. `tryReadNullColumn` consumes
bytes from the stream as it tries to interpret the column as
UNKNOWN-all-null; if it returns false the stream is past where the
widening reader expects to start. Widening goes first; tryReadNullColumn
keeps its original spot.
Tests
-----
velox_serializer_test_PrestoSerializerTest adds eight new tests covering
each widening pair plus nulls, the nested-struct two-pass path, and a
negative test:
- wideningCoercionTinyintToBigint
- wideningCoercionSmallintToInteger
- wideningCoercionIntegerToBigint
- wideningCoercionRealToDouble
- wideningCoercionDateToTimestamp
- wideningCoercionWithNulls
- wideningCoercionWithNestedStructInRow (exercises the pre-pass)
- wideningCoercionUnsupportedPairStillThrows (BIGINT -> INTEGER is
narrowing, must still throw)
All pass under all 6 compression-kind test parameterizations (48 total
test runs).
xin-zhang2
reviewed
Jun 24, 2026
| source, columnType, resultOffset, incomingNulls, numIncomingNulls, pool, result, | ||
| [](int32_t days) { | ||
| return Timestamp( | ||
| static_cast<int64_t>(days) * Timestamp::kSecondsInDay, 0); |
Member
There was a problem hiding this comment.
should timezone be considered here?
Collaborator
Author
There was a problem hiding this comment.
Good catch! I have fixed it. DATE→TIMESTAMP converter now does Timestamp::fromMillis(days * 86'400'000) then timestamp.toGMT(*zone) if opts.sessionTimezone is non-null, exactly what CastExpr::castFromDate does.
xin-zhang2
reviewed
Jun 24, 2026
| return false; | ||
| } | ||
| } | ||
| if (encoding == kIntArray) { |
Member
There was a problem hiding this comment.
Could this introduce a risk that a type encoding mismatch that previously throw an error by checkTypeEncoding is now incorrectly processed silently?
For example, if the serialized data is REAL that is encoded as kIntArray and columnType->kind() is BIGINT.
xin-zhang2
reviewed
Jun 24, 2026
xin-zhang2
left a comment
Member
There was a problem hiding this comment.
@yingsu00 I left a few comments. Please take a look. Thanks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When PushDownWidenCast rewrites a consumer fragment's source operator (TableScan or RemoteSourceNode) to declare a wider type, the producer fragment still emits narrow-typed bytes on the wire (its outputLayout stays narrow). The native worker's Exchange operator deserializes those pages via PrestoVectorSerde and must coerce narrow->wide on receive.
Before this commit, the deserialization path threw on the encoding mismatch:
"Serialized encoding is not compatible with requested type: BIGINT.
Expected LONG_ARRAY. Got INT_ARRAY."
This commti patches two call sites of
checkTypeEncoding, mirroring the two-pass deserialization inreadTopColumns:readColumns(main read pass): AddtryReadWidenedColumnwhich reads the size + null header, then dispatches toreadWideningValues <SourceT, TargetT, Converter>to read narrow bytes from the stream and write widened values into the result FlatVector.readStructNullsColumns(null-tracking pre-pass, only runs whenhasNestedStructs(childTypes)is true): AddtryStructNullsSkipWidenedwhich skipsnumValues * sizeof(SourceT)bytes. Critical: without this, the pre-pass threw before the main pass ever ran on queries whose row schema includes nested ROW types (real-world complex JOIN / aggregate plans hit this constantly).Supported widening pairs (producer encoding -> consumer kind):
Ordering matters: the widening check is inserted before the existing
tryReadNullColumncheck inreadColumns.tryReadNullColumnconsumes bytes from the stream as it tries to interpret the column as UNKNOWN-all-null; if it returns false the stream is past where the widening reader expects to start. Widening goes first; tryReadNullColumn keeps its original spot.Tests
velox_serializer_test_PrestoSerializerTest adds eight new tests covering each widening pair plus nulls, the nested-struct two-pass path, and a negative test:
narrowing, must still throw)
All pass under all 6 compression-kind test parameterizations (48 total
test runs).