refactor(tests): Make exec tests connector-agnostic#567
Conversation
4ddcd0e to
91de500
Compare
Two CI failure modes the existing pre-commit hook can't handle: 1. Missing proto-generated headers (e.g. dwrf_schema.pb.h) The 'Generate Compile Database' step's _compile_db target is supposed to build bolt_dwio_dwrf_proto and generate_parquet_thrift, but if that chain silently skips (preset-name mismatch, conan path divergence) clang-tidy later fails with a confusing "file not found." Add an explicit codegen + verification step between compile-DB generation and pre-commit. No-op rebuild if already built; otherwise the failure now names the missing file. 2. Orphan sources not in compile_commands.json Mechanical refactor sweeps (include-path / namespace renames) can touch files whose CMake targets are commented out (concrete case: bolt/examples/ScanAndSort.cpp is # add_executable() in bolt/examples/CMakeLists.txt). Without a compile entry, clang-tidy falls back to default flags (no -I paths) and emits spurious "file not found" errors for any project-internal include. Make run-clang-tidy.py drop files not in compile_commands.json from the lint set with a notice listing what was skipped.
Without this, the build produces error: /__w/bolt/bolt/bolt/benchmarks/QueryBenchmarkBase.cpp:322:19: error: invalid use of incomplete type ‘class bytedance::bolt::cache::SsdCache’ 322 | ssdCache->clear();
… to OperatorTestBase Move writeToFile (both overloads), makeVectors, and makeFilePaths from HiveConnectorTestBase to OperatorTestBase. These helpers don't carry any Hive-specific state -- writeToFile uses the DWRF writer (which any test needing a temp DWRF file will reach for), makeVectors only uses BatchMaker and pool_, and makeFilePaths just wraps TempFilePath::create(). Their home should be the generic exec fixture, not the connector-flavored one. Required so the upcoming connector-agnostic ConnectorTestBase in connectors/tests/utils does not have to pull these forward as part of the connector test scaffolding. Drops the now-unused BatchMaker/dwrf::Writer/DwrfReader includes from HiveConnectorTestBase.cpp; bolt_exec_test_lib already links the necessary dwio targets, so OperatorTestBase.cpp's new includes resolve without any CMake change.
…ase to OperatorTestBase Move writeToFile from AggregationTestBase to OperatorTestBase to avoid ambibuity.
Add process-local registration helpers for Hive and TPC-H connector factories so tests can bootstrap connectors by name without relying on static initialization side effects. Extend Hive object factory deserialization for row index columns, explicit data columns, remaining filters, and insert handles used by the generic connector test path.
…tBase
Add bolt/connectors/tests/utils/ConnectorTestBase.{h,cpp,CMakeLists.txt}:
- bytedance::bolt::connector::test::ConnectorTestBase is a GTest fixture
that inherits only ::testing::Test and bolt::test::VectorTestBase. It has
no dependency on exec::test::OperatorTestBase, so connector tests no
longer need to reach up into the exec test scaffolding to get a fixture.
- The constructor takes (connectorName, connectorId, factoryRegistrar).
SetUp() invokes the registrar if the named factory isn't already
registered, then registers an object factory + Connector under
connectorId. No hardcoded "if hive ... else if tpch ..." switch.
- Connector-agnostic helpers (makeConnectorSplit, makeConnectorSplits,
makeColumnHandle, makeTableHandle) go through ConnectorObjectFactory.
- The registration logic is factored into free functions
registerTestConnector / unregisterTestConnector so the legacy
exec::test::HiveConnectorTestBase can reuse it without inheriting the
new fixture (it still inherits OperatorTestBase, intentionally, so
existing exec tests keep compiling unchanged until commit 3 migrates
them).
The new library bolt_connector_test_lib links only bolt_connector +
bolt_vector_test_lib + a few base utilities; it does NOT link
bolt_exec_test_lib or any specific connector implementation (hive/tpch).
bolt_exec_test_lib now links bolt_connector_test_lib so the legacy
HiveConnectorTestBase can delegate its SetUp/TearDown to the shared
helpers.
This is the first step of the larger refactor: subsequent commits will
migrate exec/dwio/common tests off the legacy HiveConnectorTestBase to
use this new fixture directly, then delete the legacy file once all
callsites are migrated.
…edLoopJoinTest migration
Add the parameterization infrastructure that the exec test migration will use:
- connector::test::ConnectorTestParam (in connectors/tests/utils/ConnectorTestBase.h):
struct bundling { connectorName, connectorId, factoryRegistrar } so tests
can be parameterized over connectors via ::testing::WithParamInterface.
- Free-function variants of makeConnectorSplit / makeConnectorSplits /
makeColumnHandle / makeTableHandle taking connectorName as first arg.
Tests not inheriting connector::test::ConnectorTestBase (i.e. exec tests
that inherit OperatorTestBase + WithParamInterface) call these directly.
Two pilot migrations to establish the pattern for the rest of commit 3a:
- exec/tests/NestedLoopJoinTest.cpp: doesn't use any connector functionality
(no makeHiveConnectorSplit, no addSplit, no tableScan). Drop the
HiveConnectorTestBase parent and inherit OperatorTestBase directly. No
parameterization needed -- the test isn't actually connector-related.
- exec/tests/LimitTest.cpp: uses one makeHiveConnectorSplit call inside
limitOverLocalExchange. Migrate to OperatorTestBase + WithParamInterface
<ConnectorTestParam>, SetUp/TearDown delegate to registerTestConnector /
unregisterTestConnector, TEST_F -> TEST_P, one INSTANTIATE_TEST_SUITE_P
block at the bottom listing Hive only for now.
The remaining 11 read-side tests + 3 write-side tests follow in subsequent
commits.
…il, rename namespace
Move bolt/exec/tests/utils/Temp{File,Directory}Path.{h,cpp} to
bolt/common/testutil/, and rename the namespace from
bytedance::bolt::exec::test to bytedance::bolt::test (matching the existing
neutral test-utility namespace used by VectorMaker/VectorTestBase/BatchMaker).
These two classes are pure file-system test scaffolding -- they don't carry
any exec-engine semantics -- yet their location under bolt/exec/tests/utils/
and their bytedance::bolt::exec::test namespace forced every test that
needed a temp path (including connector tests, dwio tests, common/file
tests, and substrait tests) to reach up into exec test scaffolding to get
them. After this move, connectors/tests/utils can use TempFilePath without
re-introducing a layering inversion into exec.
The bolt_temp_path CMake target moves from bolt/exec/tests/utils/CMakeLists.txt
to bolt/common/testutil/CMakeLists.txt; its sources and one dependency
(bolt_exception) are unchanged, so every existing caller's link line still
resolves bolt_temp_path correctly.
Mechanical sweep across ~119 files:
- #include "bolt/exec/tests/utils/Temp{File,Directory}Path.h"
-> #include "bolt/common/testutil/Temp{File,Directory}Path.h"
- Qualified references: bytedance::bolt::exec::test::Temp{File,Directory}Path
-> bytedance::bolt::test::Temp{File,Directory}Path (and the shorter
exec::test::Temp... variants similarly).
- Files that referenced TempFilePath unqualified via
`using namespace bytedance::bolt::exec::test;` get an additional
`using namespace bytedance::bolt::test;` directive injected after the
exec::test one, so unqualified uses keep resolving.
…deToStringTest off Hive includes
Establish the connector-agnostic parameterization pattern for exec tests:
- bolt/connectors/tests/utils/ConnectorTestBase.{h,cpp}: add paramFor() and
paramsFor() free helpers in connector::test. paramFor(name) returns
ConnectorTestParam{name, "test-" + name, /*factoryRegistrar=*/nullptr}.
The factory is expected to already be registered with the runtime (via
the connector library's static-init self-registration when bolt_<name>_
connector is linked, or via an explicit register<Name>ConnectorFactories()
call somewhere ahead of INSTANTIATE_TEST_SUITE_P). With this, exec test
source files can stay connector-agnostic: they reference connector names
as strings (kHiveConnectorName etc.) and never include connector-specific
headers.
- bolt/connectors/tests/utils/ConnectorTestBase.cpp: also adds a
TempFilePath-vector overload of makeConnectorSplits so tests can build
splits preserving $file_size / $file_modified_time info columns without
inlining a loop. Link bolt_temp_path from bolt_connector_test_utils.
- exec/tests/LimitTest.cpp: replace the inline ConnectorTestParam{...,
&connector::hive::registerHiveConnectorFactories} construction with
::testing::ValuesIn(paramsFor({kHiveConnectorName})), and drop the
bolt/connectors/hive/HiveConnector.h include. (This corrects the wart
introduced in the LimitTest pilot in commit 9c63d25.)
- exec/tests/PrintPlanWithStatsTest.cpp: parameterize over ConnectorTestParam
(Hive only for now via paramsFor({kHiveConnectorName})), TEST_F -> TEST_P,
SetUp/TearDown delegate to register/unregisterTestConnector,
makeHiveConnectorSplits(filePaths) calls become
connector::test::makeConnectorSplits(GetParam().connectorName, ...).
No bolt/connectors/hive includes.
- exec/tests/PlanNodeToStringTest.cpp: the partitionedOutput case had a
HivePartitionFunctionSpec sub-assertion mixed in with generic ones.
Split: connector-agnostic plan-node-to-string coverage stays here; the
Hive-specific sub-assertion moves to a new file
bolt/connectors/hive/tests/HivePartitionFunctionPlanNodeToStringTest.cpp.
PlanNodeToStringTest.cpp loses its HiveConnector.h / HiveConnectorSplit.h
includes. Hive-side CMake target gains the new test file plus
bolt_exec_test_lib / bolt_hive_connector / bolt_vector_test_lib link deps.
…t/MorselDrivenTest to parameterized connector tests
Same mechanical pattern as 774e270:
- Drop bolt/exec/tests/utils/HiveConnectorTestBase.h; include
bolt/connectors/ConnectorNames.h, bolt/connectors/tests/utils/
ConnectorTestBase.h, and bolt/exec/tests/utils/OperatorTestBase.h.
- Change base from HiveConnectorTestBase to OperatorTestBase +
::testing::WithParamInterface<connector::test::ConnectorTestParam>.
Tests that already override SetUp/TearDown have the connector
register/unregister calls woven into their existing overrides.
- TEST_F/DEBUG_ONLY_TEST_F -> TEST_P/DEBUG_ONLY_TEST_P.
- makeHiveConnectorSplit(path) -> connector::test::makeConnectorSplit(
GetParam().connectorName, path).
- INSTANTIATE_TEST_SUITE_P at the bottom using connector::test::paramsFor(
{kHiveConnectorName}) -- Hive only for now, easy to extend to Iceberg etc.
TaskTest::executeSerial loses its `static` qualifier so it can reference
GetParam() for the split-construction loop. All its callers are inside
TEST_P bodies and call it as an instance method already, so the static
drop is invisible at call sites.
No bolt/connectors/hive/* includes anywhere in the migrated files.
…/AssertQueryBuilderTest Same pattern as 060ad73 (TEST_F -> TEST_P, OperatorTestBase + WithParamInterface, SetUp/TearDown delegate to register/unregisterTestConnector, makeHiveConnectorSplit -> connector::test::makeConnectorSplit, INSTANTIATE_ TEST_SUITE_P at the bottom). All three lose their bolt/connectors/hive includes. - GroupedExecutionTest: was inheriting `public virtual HiveConnectorTestBase`; the virtual is preserved on the new base (`public virtual OperatorTestBase`) to keep the existing call site shape. SetUpTestCase now chains to OperatorTestBase::SetUpTestCase directly. makeVectors() now delegates to OperatorTestBase::makeVectors (which is where it lives after 888e702). - LocalPartitionTest: straightforward parameterization. 5 makeHiveConnectorSplit call sites updated to connector::test::makeConnectorSplit(GetParam(). connectorName, ...). - AssertQueryBuilderTest: the hiveSplits TEST_F mixed generic and Hive-specific (HiveConnectorSplitBuilder + partition-keyed splits + partitionKey/regularColumn helpers) assertions in one body. Extracted the whole hiveSplits test into a new file under connectors/hive/tests/ (HiveAssertQueryBuilderTest.cpp), which inherits the legacy HiveConnectorTestBase -- transitional, since that base goes away in commit 7 of this series. Replaced the body in AssertQueryBuilderTest with a one-line comment pointing at the new file (mirroring the PlanNodeToStringTest split done in 774e270).
Commit bbcafbc removed AggregationTestBase::makeVectors(rowType, size, numVectors) and relied on the inherited OperatorTestBase::makeVectors(rowType, numVectors, rowsPerVector) instead — but the 2nd and 3rd parameters are swapped.
91cb707 to
e58e672
Compare
ZacBlanco
left a comment
There was a problem hiding this comment.
Thanks for this great refactor @yingsu00 ! Decoupling the hive connector from the test infrastructure is a big step forward for the project supporting more data connectors.
I have a few minor comments, but otherwise I am happy with this change.
Below are also a few items pointed out by an AI reviewer that seem like reasonable asks to me
Tests that migrate from HiveConnectorTestBase to OperatorTestBase + WithParamInterface<ConnectorTestParam> implement their own SetUp()/TearDown() but do not drain ioExecutor_ before calling unregisterTestConnector. Compare:
HiveConnectorTestBase::TearDown()(line 58-64): drainsioExecutor_.reset()before unregistering — correct.ConnectorTestBase::TearDown()(line 258-263): drainsioExecutor_.reset()before unregistering — correct.AssertQueryBuilderTest::TearDown()(in migrated test): does NOT drainioExecutor_beforeunregisterTestConnector.
This pattern is repeated in:
AssertQueryBuilderTest::TearDown()GroupedExecutionTest::TearDown()TaskTest::TearDown()MergeJoinTest::TearDown()(viaMergeJoinTestBase)LocalPartitionTest::TearDown()
8.3 HiveObjectFactory::makeTableHandle — dataColumns Deserialization
File: bolt/connectors/hive/HiveObjectFactory.cpp:216-230
The old code always built dataColumns from columnHandles. The new code first checks for dataColumns in options and deserializes it:
if (const auto* dc = options.get_ptr("dataColumns")) {
dataColumns = std::dynamic_pointer_cast<const RowType>(
ISerializable::deserialize<Type>(*dc, nullptr));
}Deserializing with nullptr as the memory pool parameter may create allocations in a default/global pool. If dataColumns is used in a context where a specific pool is required, this could be an issue. The remainingFilter deserialization above it uses memory::MemoryManager::getInstance()->tracePool() — inconsistency worth noting.
| - name: Ensure generated headers exist | ||
| run: | | ||
| # _compile_db (invoked by compile_db_all) is supposed to build these, | ||
| # but make it explicit so clang-tidy in the next step never races | ||
| # against a missing dwrf_schema.pb.h / parquet thrift header. If the | ||
| # underlying chain already produced them, this is a no-op rebuild. | ||
| cmake --build --preset conan-release \ | ||
| --target generate_parquet_thrift \ | ||
| --target bolt_dwio_dwrf_proto | ||
| test -f _build/Release/bolt/dwio/dwrf/proto/dwrf_schema.pb.h \ | ||
| || { echo "::error::dwrf_schema.pb.h missing after proto build" >&2; \ | ||
| ls -la _build/Release/bolt/dwio/dwrf/proto/ || true; exit 1; } |
There was a problem hiding this comment.
This shouldn't be necessary as it's already taken care of in the makefile
| for (uint32_t i = 0; i < splitCount; ++i) { | ||
| splits.emplace_back(factory->makeConnectorSplit( | ||
| effectivePath, | ||
| i * splitSize, | ||
| splitSize, | ||
| connector::makeOptions({{"fileFormat", static_cast<int>(format)}}))); |
There was a problem hiding this comment.
This seems to be a copy of logic from an earlier implementation. Can we combine the implements to prevent duplication?
|
|
||
| auto* pool = static_cast<memory::MemoryPool*>(context); | ||
| if (pool == nullptr) { | ||
| pool = memory::MemoryManager::getInstance()->tracePool(); |
There was a problem hiding this comment.
Can you explain why this change is needed?
| file=sys.stderr, | ||
| ) | ||
|
|
||
| if not files_to_process: |
There was a problem hiding this comment.
My understanding of the issue is that this stems from headers not being in the compile-commands.json file. I have run into similar issues before. Maybe we can just simply filter out files with header extensions instead? e.g. .h/.hpp/etc
What problem does this PR solve?
Decouple exec tests from the Hive-specific HiveConnectorTestBase so they can
run against any connector via parameterization, and fix the layering so
connector test scaffolding no longer depends on exec.
Issue Number: #107
Type of Change
Description
makeFilePaths had no Hive specifics; they belong on the generic exec fixture,
not the connector one.
register{Hive,Tpch}ConnectorFactories() so tests can bootstrap connectors by
name instead of relying on static-init side effects.
lib bolt_connector_test_lib) — a connector-agnostic fixture that inherits
only ::testing::Test + VectorTestBase. No dependency on bolt_exec_test_lib or
any specific connector. Legacy HiveConnectorTestBase becomes a thin wrapper
that reuses its registration helpers.
exec tests select connectors by name string, never including
connector-specific headers.
bytedance::bolt::test) — they're generic FS test utilities; keeping them
under exec/tests/utils/ forced a layering inversion. (~119-file mechanical
sweep.)
WithParamInterface (TEST_F→TEST_P,
makeHiveConnectorSplit→connector::test::makeConnectorSplit, Hive-only
sub-cases extracted to bolt/connectors/hive/tests/).
TODO
HashJoinTest/TableScanTest (heaviest), write-side tests, PlanBuilder,
dwio/common/tool migrations, and removal of the legacy HiveConnectorTestBase
— these are follow-up commits in the same series.
Notes
pre-existing unrelated LLVM/JIT compile break in bolt/jit/ThrustJITv2.cpp).
Performance Impact
Release Note
Release Note:
N/A
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes