Skip to content

refactor(tests): Make exec tests connector-agnostic#567

Open
yingsu00 wants to merge 15 commits into
bytedance:mainfrom
yingsu00:connector_refactor_5
Open

refactor(tests): Make exec tests connector-agnostic#567
yingsu00 wants to merge 15 commits into
bytedance:mainfrom
yingsu00:connector_refactor_5

Conversation

@yingsu00

@yingsu00 yingsu00 commented May 18, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Decouple exec tests from the Hive-specific HiveConnectorTestBase so they can
run against any connector via parameterization, and fix the layering so
connector test scaffolding no longer depends on exec.

Issue Number: #107

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

  • Move generic helpers to OperatorTestBase — writeToFile, makeVectors,
    makeFilePaths had no Hive specifics; they belong on the generic exec fixture,
    not the connector one.
    • Explicit connector factory registration — add
      register{Hive,Tpch}ConnectorFactories() so tests can bootstrap connectors by
      name instead of relying on static-init side effects.
    • Introduce connector::test::ConnectorTestBase (bolt/connectors/tests/utils/,
      lib bolt_connector_test_lib) — a connector-agnostic fixture that inherits
      only ::testing::Test + VectorTestBase. No dependency on bolt_exec_test_lib or
      any specific connector. Legacy HiveConnectorTestBase becomes a thin wrapper
      that reuses its registration helpers.
    • Add ConnectorTestParam + paramFor/paramsFor — parameterization plumbing so
      exec tests select connectors by name string, never including
      connector-specific headers.
    • Relocate TempFilePath/TempDirectoryPath to bolt/common/testutil/ (namespace
      bytedance::bolt::test) — they're generic FS test utilities; keeping them
      under exec/tests/utils/ forced a layering inversion. (~119-file mechanical
      sweep.)
    • Migrate 12 read-side exec tests to OperatorTestBase +
      WithParamInterface (TEST_F→TEST_P,
      makeHiveConnectorSplit→connector::test::makeConnectorSplit, Hive-only
      sub-cases extracted to bolt/connectors/hive/tests/).

TODO

HashJoinTest/TableScanTest (heaviest), write-side tests, PlanBuilder,
dwio/common/tool migrations, and removal of the legacy HiveConnectorTestBase
— these are follow-up commits in the same series.

Notes

  • No exec test source includes bolt/connectors/hive/* headers anymore.
  • Each commit builds independently (CI hooks bypassed locally only due to a
    pre-existing unrelated LLVM/JIT compile break in bolt/jit/ThrustJITv2.cpp).

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
  • Positive Impact: I have run benchmarks.
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Release Note:
N/A

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@yingsu00 yingsu00 force-pushed the connector_refactor_5 branch 27 times, most recently from 4ddcd0e to 91de500 Compare May 25, 2026 04:32
yingsu00 added 3 commits May 26, 2026 20:01
Two CI failure modes the existing pre-commit hook can't handle:

1. Missing proto-generated headers (e.g. dwrf_schema.pb.h)
   The 'Generate Compile Database' step's _compile_db target is
   supposed to build bolt_dwio_dwrf_proto and generate_parquet_thrift,
   but if that chain silently skips (preset-name mismatch, conan path
   divergence) clang-tidy later fails with a confusing "file not
   found." Add an explicit codegen + verification step between
   compile-DB generation and pre-commit. No-op rebuild if already
   built; otherwise the failure now names the missing file.

2. Orphan sources not in compile_commands.json
   Mechanical refactor sweeps (include-path / namespace renames) can
   touch files whose CMake targets are commented out (concrete case:
   bolt/examples/ScanAndSort.cpp is # add_executable() in
   bolt/examples/CMakeLists.txt). Without a compile entry, clang-tidy
   falls back to default flags (no -I paths) and emits spurious "file
   not found" errors for any project-internal include. Make
   run-clang-tidy.py drop files not in compile_commands.json from the
   lint set with a notice listing what was skipped.
Without this, the build produces error:
/__w/bolt/bolt/bolt/benchmarks/QueryBenchmarkBase.cpp:322:19: error:
invalid use of incomplete type ‘class bytedance::bolt::cache::SsdCache’
322 |           ssdCache->clear();
… to OperatorTestBase

Move writeToFile (both overloads), makeVectors, and makeFilePaths from
HiveConnectorTestBase to OperatorTestBase. These helpers don't carry any
Hive-specific state -- writeToFile uses the DWRF writer (which any test
needing a temp DWRF file will reach for), makeVectors only uses BatchMaker
and pool_, and makeFilePaths just wraps TempFilePath::create(). Their home
should be the generic exec fixture, not the connector-flavored one.

Required so the upcoming connector-agnostic ConnectorTestBase in
connectors/tests/utils does not have to pull these forward as part of
the connector test scaffolding.

Drops the now-unused BatchMaker/dwrf::Writer/DwrfReader includes from
HiveConnectorTestBase.cpp; bolt_exec_test_lib already links the necessary
dwio targets, so OperatorTestBase.cpp's new includes resolve without any
CMake change.
yingsu00 and others added 12 commits May 26, 2026 20:01
…ase to OperatorTestBase

Move writeToFile from AggregationTestBase to OperatorTestBase to avoid
ambibuity.
Add process-local registration helpers for Hive and TPC-H connector
factories so tests can bootstrap connectors by name without relying on
static initialization side effects.

Extend Hive object factory deserialization for row index columns,
explicit data columns, remaining filters, and insert handles used by the
generic connector test path.
…tBase

Add bolt/connectors/tests/utils/ConnectorTestBase.{h,cpp,CMakeLists.txt}:

- bytedance::bolt::connector::test::ConnectorTestBase is a GTest fixture
  that inherits only ::testing::Test and bolt::test::VectorTestBase. It has
  no dependency on exec::test::OperatorTestBase, so connector tests no
  longer need to reach up into the exec test scaffolding to get a fixture.
- The constructor takes (connectorName, connectorId, factoryRegistrar).
  SetUp() invokes the registrar if the named factory isn't already
  registered, then registers an object factory + Connector under
  connectorId. No hardcoded "if hive ... else if tpch ..." switch.
- Connector-agnostic helpers (makeConnectorSplit, makeConnectorSplits,
  makeColumnHandle, makeTableHandle) go through ConnectorObjectFactory.
- The registration logic is factored into free functions
  registerTestConnector / unregisterTestConnector so the legacy
  exec::test::HiveConnectorTestBase can reuse it without inheriting the
  new fixture (it still inherits OperatorTestBase, intentionally, so
  existing exec tests keep compiling unchanged until commit 3 migrates
  them).

The new library bolt_connector_test_lib links only bolt_connector +
bolt_vector_test_lib + a few base utilities; it does NOT link
bolt_exec_test_lib or any specific connector implementation (hive/tpch).
bolt_exec_test_lib now links bolt_connector_test_lib so the legacy
HiveConnectorTestBase can delegate its SetUp/TearDown to the shared
helpers.

This is the first step of the larger refactor: subsequent commits will
migrate exec/dwio/common tests off the legacy HiveConnectorTestBase to
use this new fixture directly, then delete the legacy file once all
callsites are migrated.
…edLoopJoinTest migration

Add the parameterization infrastructure that the exec test migration will use:

- connector::test::ConnectorTestParam (in connectors/tests/utils/ConnectorTestBase.h):
  struct bundling { connectorName, connectorId, factoryRegistrar } so tests
  can be parameterized over connectors via ::testing::WithParamInterface.

- Free-function variants of makeConnectorSplit / makeConnectorSplits /
  makeColumnHandle / makeTableHandle taking connectorName as first arg.
  Tests not inheriting connector::test::ConnectorTestBase (i.e. exec tests
  that inherit OperatorTestBase + WithParamInterface) call these directly.

Two pilot migrations to establish the pattern for the rest of commit 3a:

- exec/tests/NestedLoopJoinTest.cpp: doesn't use any connector functionality
  (no makeHiveConnectorSplit, no addSplit, no tableScan). Drop the
  HiveConnectorTestBase parent and inherit OperatorTestBase directly. No
  parameterization needed -- the test isn't actually connector-related.

- exec/tests/LimitTest.cpp: uses one makeHiveConnectorSplit call inside
  limitOverLocalExchange. Migrate to OperatorTestBase + WithParamInterface
  <ConnectorTestParam>, SetUp/TearDown delegate to registerTestConnector /
  unregisterTestConnector, TEST_F -> TEST_P, one INSTANTIATE_TEST_SUITE_P
  block at the bottom listing Hive only for now.

The remaining 11 read-side tests + 3 write-side tests follow in subsequent
commits.
…il, rename namespace

Move bolt/exec/tests/utils/Temp{File,Directory}Path.{h,cpp} to
bolt/common/testutil/, and rename the namespace from
bytedance::bolt::exec::test to bytedance::bolt::test (matching the existing
neutral test-utility namespace used by VectorMaker/VectorTestBase/BatchMaker).

These two classes are pure file-system test scaffolding -- they don't carry
any exec-engine semantics -- yet their location under bolt/exec/tests/utils/
and their bytedance::bolt::exec::test namespace forced every test that
needed a temp path (including connector tests, dwio tests, common/file
tests, and substrait tests) to reach up into exec test scaffolding to get
them. After this move, connectors/tests/utils can use TempFilePath without
re-introducing a layering inversion into exec.

The bolt_temp_path CMake target moves from bolt/exec/tests/utils/CMakeLists.txt
to bolt/common/testutil/CMakeLists.txt; its sources and one dependency
(bolt_exception) are unchanged, so every existing caller's link line still
resolves bolt_temp_path correctly.

Mechanical sweep across ~119 files:
- #include "bolt/exec/tests/utils/Temp{File,Directory}Path.h"
  -> #include "bolt/common/testutil/Temp{File,Directory}Path.h"
- Qualified references: bytedance::bolt::exec::test::Temp{File,Directory}Path
  -> bytedance::bolt::test::Temp{File,Directory}Path (and the shorter
  exec::test::Temp... variants similarly).
- Files that referenced TempFilePath unqualified via
  `using namespace bytedance::bolt::exec::test;` get an additional
  `using namespace bytedance::bolt::test;` directive injected after the
  exec::test one, so unqualified uses keep resolving.
…deToStringTest off Hive includes

Establish the connector-agnostic parameterization pattern for exec tests:

- bolt/connectors/tests/utils/ConnectorTestBase.{h,cpp}: add paramFor() and
  paramsFor() free helpers in connector::test. paramFor(name) returns
  ConnectorTestParam{name, "test-" + name, /*factoryRegistrar=*/nullptr}.
  The factory is expected to already be registered with the runtime (via
  the connector library's static-init self-registration when bolt_<name>_
  connector is linked, or via an explicit register<Name>ConnectorFactories()
  call somewhere ahead of INSTANTIATE_TEST_SUITE_P). With this, exec test
  source files can stay connector-agnostic: they reference connector names
  as strings (kHiveConnectorName etc.) and never include connector-specific
  headers.

- bolt/connectors/tests/utils/ConnectorTestBase.cpp: also adds a
  TempFilePath-vector overload of makeConnectorSplits so tests can build
  splits preserving $file_size / $file_modified_time info columns without
  inlining a loop. Link bolt_temp_path from bolt_connector_test_utils.

- exec/tests/LimitTest.cpp: replace the inline ConnectorTestParam{...,
  &connector::hive::registerHiveConnectorFactories} construction with
  ::testing::ValuesIn(paramsFor({kHiveConnectorName})), and drop the
  bolt/connectors/hive/HiveConnector.h include. (This corrects the wart
  introduced in the LimitTest pilot in commit 9c63d25.)

- exec/tests/PrintPlanWithStatsTest.cpp: parameterize over ConnectorTestParam
  (Hive only for now via paramsFor({kHiveConnectorName})), TEST_F -> TEST_P,
  SetUp/TearDown delegate to register/unregisterTestConnector,
  makeHiveConnectorSplits(filePaths) calls become
  connector::test::makeConnectorSplits(GetParam().connectorName, ...).
  No bolt/connectors/hive includes.

- exec/tests/PlanNodeToStringTest.cpp: the partitionedOutput case had a
  HivePartitionFunctionSpec sub-assertion mixed in with generic ones.
  Split: connector-agnostic plan-node-to-string coverage stays here; the
  Hive-specific sub-assertion moves to a new file
  bolt/connectors/hive/tests/HivePartitionFunctionPlanNodeToStringTest.cpp.
  PlanNodeToStringTest.cpp loses its HiveConnector.h / HiveConnectorSplit.h
  includes. Hive-side CMake target gains the new test file plus
  bolt_exec_test_lib / bolt_hive_connector / bolt_vector_test_lib link deps.
…t/MorselDrivenTest to parameterized connector tests

Same mechanical pattern as 774e270:

- Drop bolt/exec/tests/utils/HiveConnectorTestBase.h; include
  bolt/connectors/ConnectorNames.h, bolt/connectors/tests/utils/
  ConnectorTestBase.h, and bolt/exec/tests/utils/OperatorTestBase.h.
- Change base from HiveConnectorTestBase to OperatorTestBase +
  ::testing::WithParamInterface<connector::test::ConnectorTestParam>.
  Tests that already override SetUp/TearDown have the connector
  register/unregister calls woven into their existing overrides.
- TEST_F/DEBUG_ONLY_TEST_F -> TEST_P/DEBUG_ONLY_TEST_P.
- makeHiveConnectorSplit(path) -> connector::test::makeConnectorSplit(
  GetParam().connectorName, path).
- INSTANTIATE_TEST_SUITE_P at the bottom using connector::test::paramsFor(
  {kHiveConnectorName}) -- Hive only for now, easy to extend to Iceberg etc.

TaskTest::executeSerial loses its `static` qualifier so it can reference
GetParam() for the split-construction loop. All its callers are inside
TEST_P bodies and call it as an instance method already, so the static
drop is invisible at call sites.

No bolt/connectors/hive/* includes anywhere in the migrated files.
…/AssertQueryBuilderTest

Same pattern as 060ad73 (TEST_F -> TEST_P, OperatorTestBase +
WithParamInterface, SetUp/TearDown delegate to register/unregisterTestConnector,
makeHiveConnectorSplit -> connector::test::makeConnectorSplit, INSTANTIATE_
TEST_SUITE_P at the bottom). All three lose their bolt/connectors/hive
includes.

- GroupedExecutionTest: was inheriting `public virtual HiveConnectorTestBase`;
  the virtual is preserved on the new base (`public virtual OperatorTestBase`)
  to keep the existing call site shape. SetUpTestCase now chains to
  OperatorTestBase::SetUpTestCase directly. makeVectors() now delegates to
  OperatorTestBase::makeVectors (which is where it lives after 888e702).

- LocalPartitionTest: straightforward parameterization. 5 makeHiveConnectorSplit
  call sites updated to connector::test::makeConnectorSplit(GetParam().
  connectorName, ...).

- AssertQueryBuilderTest: the hiveSplits TEST_F mixed generic and
  Hive-specific (HiveConnectorSplitBuilder + partition-keyed splits +
  partitionKey/regularColumn helpers) assertions in one body. Extracted the
  whole hiveSplits test into a new file under connectors/hive/tests/
  (HiveAssertQueryBuilderTest.cpp), which inherits the legacy
  HiveConnectorTestBase -- transitional, since that base goes away in
  commit 7 of this series. Replaced the body in AssertQueryBuilderTest with
  a one-line comment pointing at the new file (mirroring the
  PlanNodeToStringTest split done in 774e270).
Commit bbcafbc removed AggregationTestBase::makeVectors(rowType, size, numVectors) and
  relied on the inherited OperatorTestBase::makeVectors(rowType, numVectors, rowsPerVector)
  instead — but the 2nd and 3rd parameters are swapped.
@ZacBlanco ZacBlanco force-pushed the connector_refactor_5 branch from 91cb707 to e58e672 Compare May 26, 2026 21:01

@ZacBlanco ZacBlanco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great refactor @yingsu00 ! Decoupling the hive connector from the test infrastructure is a big step forward for the project supporting more data connectors.

I have a few minor comments, but otherwise I am happy with this change.

Below are also a few items pointed out by an AI reviewer that seem like reasonable asks to me


Tests that migrate from HiveConnectorTestBase to OperatorTestBase + WithParamInterface<ConnectorTestParam> implement their own SetUp()/TearDown() but do not drain ioExecutor_ before calling unregisterTestConnector. Compare:

  • HiveConnectorTestBase::TearDown() (line 58-64): drains ioExecutor_.reset() before unregistering — correct.
  • ConnectorTestBase::TearDown() (line 258-263): drains ioExecutor_.reset() before unregistering — correct.
  • AssertQueryBuilderTest::TearDown() (in migrated test): does NOT drain ioExecutor_ before unregisterTestConnector.

This pattern is repeated in:

  • AssertQueryBuilderTest::TearDown()
  • GroupedExecutionTest::TearDown()
  • TaskTest::TearDown()
  • MergeJoinTest::TearDown() (via MergeJoinTestBase)
  • LocalPartitionTest::TearDown()

8.3 HiveObjectFactory::makeTableHandledataColumns Deserialization

File: bolt/connectors/hive/HiveObjectFactory.cpp:216-230

The old code always built dataColumns from columnHandles. The new code first checks for dataColumns in options and deserializes it:

if (const auto* dc = options.get_ptr("dataColumns")) {
  dataColumns = std::dynamic_pointer_cast<const RowType>(
      ISerializable::deserialize<Type>(*dc, nullptr));
}

Deserializing with nullptr as the memory pool parameter may create allocations in a default/global pool. If dataColumns is used in a context where a specific pool is required, this could be an issue. The remainingFilter deserialization above it uses memory::MemoryManager::getInstance()->tracePool() — inconsistency worth noting.

Comment on lines +63 to +74
- name: Ensure generated headers exist
run: |
# _compile_db (invoked by compile_db_all) is supposed to build these,
# but make it explicit so clang-tidy in the next step never races
# against a missing dwrf_schema.pb.h / parquet thrift header. If the
# underlying chain already produced them, this is a no-op rebuild.
cmake --build --preset conan-release \
--target generate_parquet_thrift \
--target bolt_dwio_dwrf_proto
test -f _build/Release/bolt/dwio/dwrf/proto/dwrf_schema.pb.h \
|| { echo "::error::dwrf_schema.pb.h missing after proto build" >&2; \
ls -la _build/Release/bolt/dwio/dwrf/proto/ || true; exit 1; }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary as it's already taken care of in the makefile

Comment thread bolt/connectors/tests/utils/ConnectorTestBase.cpp
Comment on lines +293 to +298
for (uint32_t i = 0; i < splitCount; ++i) {
splits.emplace_back(factory->makeConnectorSplit(
effectivePath,
i * splitSize,
splitSize,
connector::makeOptions({{"fileFormat", static_cast<int>(format)}})));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a copy of logic from an earlier implementation. Can we combine the implements to prevent duplication?

Comment thread bolt/core/Expressions.cpp

auto* pool = static_cast<memory::MemoryPool*>(context);
if (pool == nullptr) {
pool = memory::MemoryManager::getInstance()->tracePool();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this change is needed?

Comment thread bolt/connectors/tests/utils/ConnectorTestBase.cpp
Comment thread scripts/run-clang-tidy.py
file=sys.stderr,
)

if not files_to_process:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the issue is that this stems from headers not being in the compile-commands.json file. I have run into similar issues before. Maybe we can just simply filter out files with header extensions instead? e.g. .h/.hpp/etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants