feat(dwio): Add DATE to TIMESTAMP widening coercion in DWIO readers#2170
feat(dwio): Add DATE to TIMESTAMP widening coercion in DWIO readers#2170yingsu00 wants to merge 1 commit into
Conversation
| VectorPtr* result) { | ||
| VELOX_CHECK_EQ(valueSize_, sizeof(int32_t)); | ||
| VELOX_CHECK(mayGetValues_); | ||
| mayGetValues_ = false; |
There was a problem hiding this comment.
why is mayGetValues_ set to false here?
There was a problem hiding this comment.
@xin-zhang2 I actually realized there was a bug related to this. convertDateToTimestampValues, like getFlatValues, can be called multiple times for different rows. In getFlatValues, mayGetValues_ was set as false if it's the last batch on this values_ buffer:
VELOX_CHECK(mayGetValues_);
if (isFinal) {
mayGetValues_ = false;
}
Strictly speaking, after isFinal=true the flag has no use for the current caller: they got their result and are done. The flip is a safety net for the next caller (or a buggy continuation of the same caller): if someone tries to extract again from this buffer without first calling read() to refill it, the VELOX_CHECK(mayGetValues_) at function entry fires loudly instead of silently emitting another vector backed by the same source bytes.
Without the mayGetValues_ = false;, the failure mode is silent rather than detected. That's why mayGetValues_ was set to false in getFlatValues. Here we want to follow the same contract and set it to false in convertDateToTimestampValues. However, setting it to false without testing isFinal would make the following batch fail directly on line 330.
I have fixed it to follow the same routine that only when isFinal == true, that we set it to false. More tests were added as well. In addition, I added a slight improvement that we would do the conversion for the whole values_ buffer if the requested rows is more than half of the size of the values_ buffer. Please review again.
| timestamps[i] = | ||
| Timestamp(kSecondsPerDay * static_cast<int64_t>(rawDays[i]), 0); | ||
| } | ||
|
|
| // verify that the selective reader can widen DATE to TIMESTAMP at read time, | ||
| // producing Timestamp(days * 86400, 0). | ||
|
|
||
| TEST_F(TestReader, readDateColumnAsTimestamp) { |
There was a problem hiding this comment.
Both these new tests are failing with below error. Please check.
.../velox/common/memory/MemoryPool.cpp:367, Function:addAggregateChild, Expression: Memory pool addAggregateChild operation is only allowed on aggregation memory pool: Memory Pool[leaf LEAF root[default_root_109] parent[default_root_109] MALLOC track-usage thread-safe]<unlimited max capacity unlimited capacity used 1.00MB available 0B reservation [used 1.00MB, reserved 1.00MB, min 0B] counters [allocs 1, frees 0, reserves 0, releases 0, collisions 0, external-allocs 0, external-frees 0, cumulative-external 0B])>, Source: RUNTIME, ErrorCode: INVALID_STATE
There was a problem hiding this comment.
Thanks for checking this @nmahadevuni . The test fixture creates a LEAF pool and then tries to addAggregateChild on it. This is a pre-existing test-fixture bug, not a regression introduced by the DATE→TIMESTAMP work. I have updated the test from
// Before
dwrf::Writer writer{writerOptions, std::move(sink), *pool()};
to
// After
dwrf::Writer writer{writerOptions, std::move(sink), *rootPool_};
The tests pass now.
There was a problem hiding this comment.
btw. There was also another cause of the failures. I sent another PR for it. Could you please review #2184?
| auto* timestamps = timestampValues->asMutable<Timestamp>(); | ||
| for (auto i = 0; i < rows.size(); ++i) { | ||
| timestamps[i] = | ||
| Timestamp(kSecondsPerDay * static_cast<int64_t>(rawDays[i]), 0); |
There was a problem hiding this comment.
We should read back from compactDays instead of rawDays buffer? This read path is used when filtering happens? We need to add a test for this path too.
There was a problem hiding this comment.
@nmahadevuni These two variables are byte-identical aliases, so there was no correctness issue. But I agree with you that it's more accurate to use compactDays here. I have updated the code to use compactDays instead of rawDays.
Reading a DATE column as TIMESTAMP previously failed at schema-compatibility validation because DateType inherits from IntegerType (kind() == INTEGER), and integer→timestamp is not a numeric widening. DWIO readers now accept the coercion and emit Timestamp (days × 86400 seconds, 0 nanos) at read time. The widening is wired in three places. TypeUtils.cpp short-circuits isCompatible() when the file type is DATE and the requested kind is TIMESTAMP. SelectiveColumnReader gains convertDateToTimestampValues(), dispatched from getIntValues() when the requested kind is TIMESTAMP and the file type is DATE — it follows the same sourceRows iteration pattern as upcastScalarValues but applies the days × 86400 conversion and stores Timestamp structs. ParquetReader.cpp's convertType() for ConvertedType::DATE now also accepts TIMESTAMP as a valid requested type, in addition to DATE itself. Test plan: - velox/dwio/parquet/tests/reader/ParquetReaderTest.cpp: - readDateColumnAsTimestamp - readDateColumnAsTimestampWithNulls - readRealColumnAsDouble, readRealColumnAsDoubleWithNulls — coverage for pre-existing FLOAT→DOUBLE widening, no production change in this commit. - velox/dwio/dwrf/test/ReaderTest.cpp: - readDateColumnAsTimestamp - readDateColumnAsTimestampWithNulls
da93d84 to
623a6b1
Compare
Reading a DATE column as TIMESTAMP previously failed at schema-compatibility validation because DateType inherits from IntegerType (kind() == INTEGER), and integer→timestamp is not a numeric widening. DWIO readers now accept the coercion and emit Timestamp (days × 86400 seconds, 0 nanos) at read time.
The widening is wired in three places. TypeUtils.cpp short-circuits isCompatible() when the file type is DATE and the requested kind is TIMESTAMP. SelectiveColumnReader gains convertDateToTimestampValues(), dispatched from getIntValues() when the requested kind is TIMESTAMP and the file type is DATE — it follows the same sourceRows iteration pattern as upcastScalarValues but applies the days × 86400 conversion and stores Timestamp structs. ParquetReader.cpp's convertType() for ConvertedType::DATE now also accepts TIMESTAMP as a valid requested type, in addition to DATE itself.
Test plan: