Skip to content

feat(dwio): Add DATE to TIMESTAMP widening coercion in DWIO readers#2170

Open
yingsu00 wants to merge 1 commit into
IBM:boltfrom
yingsu00:DateToTimestamp
Open

feat(dwio): Add DATE to TIMESTAMP widening coercion in DWIO readers#2170
yingsu00 wants to merge 1 commit into
IBM:boltfrom
yingsu00:DateToTimestamp

Conversation

@yingsu00

Copy link
Copy Markdown
Collaborator

Reading a DATE column as TIMESTAMP previously failed at schema-compatibility validation because DateType inherits from IntegerType (kind() == INTEGER), and integer→timestamp is not a numeric widening. DWIO readers now accept the coercion and emit Timestamp (days × 86400 seconds, 0 nanos) at read time.

The widening is wired in three places. TypeUtils.cpp short-circuits isCompatible() when the file type is DATE and the requested kind is TIMESTAMP. SelectiveColumnReader gains convertDateToTimestampValues(), dispatched from getIntValues() when the requested kind is TIMESTAMP and the file type is DATE — it follows the same sourceRows iteration pattern as upcastScalarValues but applies the days × 86400 conversion and stores Timestamp structs. ParquetReader.cpp's convertType() for ConvertedType::DATE now also accepts TIMESTAMP as a valid requested type, in addition to DATE itself.

Test plan:

  • velox/dwio/parquet/tests/reader/ParquetReaderTest.cpp:
    • readDateColumnAsTimestamp
    • readDateColumnAsTimestampWithNulls
    • readRealColumnAsDouble, readRealColumnAsDoubleWithNulls — coverage for pre-existing FLOAT→DOUBLE widening, no production change in this commit.
  • velox/dwio/dwrf/test/ReaderTest.cpp:
    • readDateColumnAsTimestamp
    • readDateColumnAsTimestampWithNulls

@yingsu00 yingsu00 requested a review from majetideepak as a code owner June 23, 2026 03:28
@yingsu00 yingsu00 added the bolt label Jun 23, 2026
@yingsu00 yingsu00 requested review from nmahadevuni and xin-zhang2 and removed request for majetideepak June 23, 2026 03:29
VectorPtr* result) {
VELOX_CHECK_EQ(valueSize_, sizeof(int32_t));
VELOX_CHECK(mayGetValues_);
mayGetValues_ = false;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is mayGetValues_ set to false here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xin-zhang2 I actually realized there was a bug related to this. convertDateToTimestampValues, like getFlatValues, can be called multiple times for different rows. In getFlatValues, mayGetValues_ was set as false if it's the last batch on this values_ buffer:

VELOX_CHECK(mayGetValues_);
  if (isFinal) {
    mayGetValues_ = false;
  }

Strictly speaking, after isFinal=true the flag has no use for the current caller: they got their result and are done. The flip is a safety net for the next caller (or a buggy continuation of the same caller): if someone tries to extract again from this buffer without first calling read() to refill it, the VELOX_CHECK(mayGetValues_) at function entry fires loudly instead of silently emitting another vector backed by the same source bytes.

Without the mayGetValues_ = false;, the failure mode is silent rather than detected. That's why mayGetValues_ was set to false in getFlatValues. Here we want to follow the same contract and set it to false in convertDateToTimestampValues. However, setting it to false without testing isFinal would make the following batch fail directly on line 330.

I have fixed it to follow the same routine that only when isFinal == true, that we set it to false. More tests were added as well. In addition, I added a slight improvement that we would do the conversion for the whole values_ buffer if the requested rows is more than half of the size of the values_ buffer. Please review again.

timestamps[i] =
Timestamp(kSecondsPerDay * static_cast<int64_t>(rawDays[i]), 0);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the blank line.

// verify that the selective reader can widen DATE to TIMESTAMP at read time,
// producing Timestamp(days * 86400, 0).

TEST_F(TestReader, readDateColumnAsTimestamp) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both these new tests are failing with below error. Please check.

.../velox/common/memory/MemoryPool.cpp:367, Function:addAggregateChild, Expression:  Memory pool addAggregateChild operation is only allowed on aggregation memory pool: Memory Pool[leaf LEAF root[default_root_109] parent[default_root_109] MALLOC track-usage thread-safe]<unlimited max capacity unlimited capacity used 1.00MB available 0B reservation [used 1.00MB, reserved 1.00MB, min 0B] counters [allocs 1, frees 0, reserves 0, releases 0, collisions 0, external-allocs 0, external-frees 0, cumulative-external 0B])>, Source: RUNTIME, ErrorCode: INVALID_STATE

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking this @nmahadevuni . The test fixture creates a LEAF pool and then tries to addAggregateChild on it. This is a pre-existing test-fixture bug, not a regression introduced by the DATE→TIMESTAMP work. I have updated the test from

  // Before
  dwrf::Writer writer{writerOptions, std::move(sink), *pool()};

to

  // After
  dwrf::Writer writer{writerOptions, std::move(sink), *rootPool_};

The tests pass now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. There was also another cause of the failures. I sent another PR for it. Could you please review #2184?

auto* timestamps = timestampValues->asMutable<Timestamp>();
for (auto i = 0; i < rows.size(); ++i) {
timestamps[i] =
Timestamp(kSecondsPerDay * static_cast<int64_t>(rawDays[i]), 0);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should read back from compactDays instead of rawDays buffer? This read path is used when filtering happens? We need to add a test for this path too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni These two variables are byte-identical aliases, so there was no correctness issue. But I agree with you that it's more accurate to use compactDays here. I have updated the code to use compactDays instead of rawDays.

Reading a DATE column as TIMESTAMP previously failed at
schema-compatibility validation because DateType inherits from
IntegerType (kind() == INTEGER), and integer→timestamp is not a numeric
widening. DWIO readers now accept the coercion and emit Timestamp
(days × 86400 seconds, 0 nanos) at read time.

The widening is wired in three places. TypeUtils.cpp short-circuits
isCompatible() when the file type is DATE and the requested kind is
TIMESTAMP. SelectiveColumnReader gains convertDateToTimestampValues(),
dispatched from getIntValues() when the requested kind is TIMESTAMP and
the file type is DATE — it follows the same sourceRows iteration pattern
as upcastScalarValues but applies the days × 86400 conversion and stores
Timestamp structs. ParquetReader.cpp's convertType() for
ConvertedType::DATE now also accepts TIMESTAMP as a valid requested type,
in addition to DATE itself.

Test plan:
- velox/dwio/parquet/tests/reader/ParquetReaderTest.cpp:
  - readDateColumnAsTimestamp
  - readDateColumnAsTimestampWithNulls
  - readRealColumnAsDouble, readRealColumnAsDoubleWithNulls — coverage for pre-existing FLOAT→DOUBLE widening, no production change in this commit.
- velox/dwio/dwrf/test/ReaderTest.cpp:
  - readDateColumnAsTimestamp
  - readDateColumnAsTimestampWithNulls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants