Skip to content

chore: modernize DVT to upgrade ibis-framework to 7.1.0#1745

Open
renzokuken wants to merge 38 commits into
developfrom
ibis-7.1-modernization
Open

chore: modernize DVT to upgrade ibis-framework to 7.1.0#1745
renzokuken wants to merge 38 commits into
developfrom
ibis-7.1-modernization

Conversation

@renzokuken

@renzokuken renzokuken commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Description of changes

Write a description of the changes you have made in this PR. Extremely small changes such as fixing typos do not need a description.

Issues to be closed

Note: Before submitting a pull request, please open an issue for discussion if you are not associated with Google.

Closes #931

Checklist

  • I have followed the CONTRIBUTING Guide.
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated any relevant documentation to reflect my changes, if applicable
  • I have added unit and/or integration tests relevant to my change as needed
  • I have already checked locally that all unit tests and linting are passing (use the tests/local_check.sh script)
  • I have manually executed end-to-end testing (E2E) with the affected databases/engines

@renzokuken

Copy link
Copy Markdown
Collaborator Author

DVT Migration Plan: Upgrading to Ibis 7.1.0 (Safer Intermediate Modernization)

Executive Summary

This document outlines the modernization plan for upgrading the Data Validation Tool (DVT) from Ibis 5.1.0 to Ibis 7.1.0.

While a jump straight to Ibis 9.0.0 was scaffolded in other work, it introduces extensive breaking changes due to a full rewrite of the compilation engine to SQLGlot and the removal of the older Pandas interface in favor of DuckDB.

Ibis 7.1.0 represents a highly strategic, safer "sweet spot" for DVT's modernization:

  1. Retains Pandas Interface: It preserves the legacy Pandas client interface which DVT uses extensively for local filesystem and memory validations.
  2. Maintains Backend Compilers: It retains the classic Compiler and ExprTranslator class hierarchy in ibis.backends.base.sql, meaning DVT's custom backends (Teradata, DB2, DB2 z/OS, Sybase, Spanner) do not require a full SQLGlot compiler rewrite and can be updated with minimal modifications.
  3. Upgrades Supported Backends: We retire DVT's custom copies/wrappers of native backends (BigQuery, Postgres, Impala, MSSQL, MySQL, Oracle, Snowflake) and migrate to Ibis 7.1.0's fully mature, native connection interfaces directly.

Architecture Comparison: 5.1.0 vs 7.1.0 vs 9.0.0

Feature / Architecture Ibis 5.1.0 (Current) Ibis 7.1.0 (Target) Ibis 9.0.0 (Alternative)
Pandas Interface Local DataFrames Local DataFrames Fully replaced by DuckDB
Compiler Engine SQL Alchemy & Base SQL SQL Alchemy & Base SQL SQLGlot Compiler
Native SQL Backends Partials in third_party Fully Natively Supported Fully Natively Supported
Custom Compilers Standard ExprTranslator Standard ExprTranslator Rewrite visit_<Operation>
Upgrade Risk None (Status Quo) Low / Medium Extremely High

Step-by-Step Migration Strategy

1. Dependency Alignment (setup.py)

Update setup.py to specify ibis-framework==7.1.0. The Python environment requirements are updated to Python 3.10+. Python 3.11 is ideal and fully compatible with all dependencies, including numpy 1.x and pandas 2.x prebuilt wheels.

2. Retire Obsolete Custom Backend Copies

In DVT's legacy setup, custom directories for native backends were kept in third_party/ibis/ because of earlier limitations. Under Ibis 7.1.0, we leverage the native connection methods in the CLIENT_LOOKUP map inside data_validation/clients.py:

  • BigQuery: Uses ibis.bigquery.connect
  • Impala: Uses ibis.impala.connect
  • MySQL: Uses ibis.mysql.connect
  • Postgres: Uses ibis.postgres.connect
  • MSSQL: Uses lazy/dynamic ibis.mssql.connect
  • Oracle: Uses lazy/dynamic ibis.oracle.connect
  • Snowflake: Uses lazy/dynamic ibis.snowflake.connect

Note: Dynamic wrappers are added to prevent import-time PackageNotFoundError exceptions when specific database drivers are not present in the runtime environment.

3. Consolidate Custom Dialect Extensions in operations.py

DVT extends SQL dialects to add custom operations (ops.HashBytes, RawSQL, ToChar, PaddedCharLength, BinaryLength). Rather than referencing registry modules inside retired third_party/ibis/* directories, we define these helper functions directly inside third_party/ibis/ibis_addon/operations.py and register them on the native Ibis translator classes.

4. Update Registry Mappings for Internal Ibis Renames

Ibis 7.1.0 introduced minor internal API cleanups:

  • ops.IfNull was replaced by the more generic ops.Coalesce (which takes a variable-length tuple of arguments instead of two scalar fields).
  • ops.NotAny, ops.NotAll, ops.CumulativeAll, and ops.CumulativeAny are removed from operations as Ibis now translates them into standard windowed All/Any nodes automatically.
  • to_sqla_type was refactored to AlchemyType.from_ibis inside ibis.backends.base.sql.alchemy.datatypes.

We updated the registry files for the custom backends (Teradata, DB2, DB2 z/OS, and Sybase) to reflect these changes.


Code Modifications (PR Summary)

Setup & Client Connection Updates

setup.py was updated to pin ibis-framework==7.1.0.

data_validation/clients.py imports spanner_connect and redshift_connect from third_party but leverages standard native backends for all other databases. Safe dynamic loaders are added:

def oracle_connect(*args, **kwargs):
    try:
        return ibis.oracle.connect(*args, **kwargs)
    except ImportError:
        raise Exception("pip install oracledb")

def snowflake_connect(*args, **kwargs):
    try:
        return ibis.snowflake.connect(*args, **kwargs)
    except ImportError:
        raise Exception("pip install snowflake-connector-python && pip install snowflake-sqlalchemy")

Custom Operator Modernization

third_party/ibis/ibis_addon/operations.py was modernized to use class-level type annotations instead of obsolete rlz.one_of rules, aligning with Ibis 7.x's new pattern:

class BinaryLength(ops.Value):
    arg: ops.Value[dt.Binary | dt.String]
    dtype = dt.int32
    shape = rlz.shape_like("arg")

class PaddedCharLength(ops.Value):
    arg: ops.Value[dt.String]
    dtype = dt.int32
    shape = rlz.shape_like("arg")

class ToChar(ops.Value):
    arg: ops.Value[dt.Decimal | dt.Float64 | dt.Date | dt.Time | dt.Timestamp]
    fmt: ops.Value[dt.String]
    dtype = dt.string
    shape = rlz.shape_like("arg")

class RawSQL(ops.Comparison):
    left: ops.Value[dt.String]
    right: ops.Value[dt.String]

Database Dialects Registration

All translator overrides are registered directly onto the native Ibis classes:

BigQueryExprTranslator._registry[ops.HashBytes] = bigquery_format_hashbytes
ImpalaExprTranslator._registry[ops.Coalesce] = impala_sa_ifnull
PostgreSQLExprTranslator._registry[ops.Cast] = postgres_sa_cast
MsSqlExprTranslator._registry[ops.Coalesce] = sa_fixed_arity(sa.func.isnull, 2)

Verification and Results

All verification tests were executed successfully using Python 3.11.9 with dependencies loaded cleanly from ./venv_new_libs:

  1. Syntax and Schema Compilation: Verified that all custom operators compile correctly into standard Ibis nodes.
  2. Operations Module Import: Verified that third_party.ibis.ibis_addon.operations imports 100% cleanly under Ibis 7.1.0.
  3. DVT Entrypoint & Core Clients: Verified that data_validation.clients and DVT's CLI entrypoint (data_validation.__main__) import cleanly without any warnings or AttributeErrors.

Verification Commands

# Verify Ibis Addon Operations
python3.11 scratch/test_operations_import.py
# Output: Imported Ibis version: 7.1.0 | Successfully imported third_party.ibis.ibis_addon.operations!

# Verify Clients Registry
python3.11 scratch/test_clients_import.py
# Output: Imported Ibis version: 7.1.0 | Successfully imported data_validation.clients!

# Verify DVT Main entry point
python3.11 scratch/test_dvt_import.py
# Output: Imported Ibis version: 7.1.0 | Successfully imported data_validation.__main__ and its main entry point!

Next Steps

  1. Review Implementation Changes: The changes have been applied to the ibis-7.1-modernization branch.
  2. CI Pipeline Integration: Run full integration test suites for specialized targets (like Oracle, DB2, Snowflake) in test containers to validate runtime execution of translated queries.
  3. Merge to Mainline: Merge the ibis-7.1-modernization branch into develop.

@nj1973

nj1973 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

/gcbrun

nj1973 and others added 3 commits June 19, 2026 12:46
* fix: Use UTC for naive timestamp epoch conversion

* fix: Refactor import of clients to avoid circular imports

Moved the import of clients to within the function to avoid circular imports and reduce heavy weight loading in the utility module.

* test: Cover pre-1970 timezone-aware epoch

---------

Co-authored-by: Neil Johnson <neiljohnson@google.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
@renzokuken

Copy link
Copy Markdown
Collaborator Author

/gcbrun

@nj1973

nj1973 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ibis upgrade to 7.0

4 participants