Skip to content

feat: Add Advanced ClickHouse AT Mode Support#8007

Open
Sumit6307 wants to merge 19 commits into
apache:2.xfrom
Sumit6307:feature/gsoc-clickhouse-at-mode
Open

feat: Add Advanced ClickHouse AT Mode Support#8007
Sumit6307 wants to merge 19 commits into
apache:2.xfrom
Sumit6307:feature/gsoc-clickhouse-at-mode

Conversation

@Sumit6307
Copy link
Copy Markdown
Contributor

@Sumit6307 Sumit6307 commented Mar 4, 2026

Ⅰ. Describe what this PR did

This PR introduces native AT Mode support for ClickHouse. Because ClickHouse handles data mutations asynchronously and uses different syntax compared to standard relational databases, new executors were built specifically for its SQL dialect.

Key Changes:

  1. ClickhouseTableMetaCache: Designed to properly extract table metadata using ClickHouse's system tables (system.columns equivalent via JDBC).
  2. ClickhouseUndoInsertExecutor: Implements the ALTER TABLE ... DELETE WHERE ... syntax to rollback inserted records.
  3. ClickhouseUndoUpdateExecutor: Implements the ALTER TABLE ... UPDATE ... WHERE ... syntax for undoing updates.
  4. ClickhouseUndoDeleteExecutor: Restores deleted records via standard INSERT INTO.
  5. ClickhouseUndoLogManager: Customized the batch deletion of stale undo logs to prevent syntax errors related to standard SQL DELETE FROM.
  6. Registered the ClickHouse Executor Holders and Meta Caches via Java SPI.

Ⅱ. Does this pull request fix one issue?

This operates as a Proof-of-Concept for a formal Google Summer of Code (GSoC) 2026 contribution.

Ⅲ. Why don't you add test cases (unit test/integration test)?

N/A - Unit tests have been added.

Added the following JUnit mock tests to verify exact SQL generation strings:

  • ClickhouseUndoInsertExecutorTest
  • ClickhouseUndoUpdateExecutorTest
  • ClickhouseUndoDeleteExecutorTest

Ⅳ. Describe how to verify it

You can verify the SQL generation through the test suite:

./mvnw clean test -pl rm-datasource -Dtest=ClickhouseUndo*


### Ⅴ. Special notes for reviews
ClickHouse is fundamentally an OLAP database, but its usage in high-throughput environments necessitates distributed transaction control for bulk inserts/mutations. The JDBC driver index fetching can occasionally return ambiguous metadata for certain MergeTree variants, which might need specialized tuning in 

ClickhouseTableMetaCache in follow-up PRs.

@Sumit6307 Sumit6307 changed the title feat: Add Advanced ClickHouse AT Mode Support (GSoC Proposal) feat: Add Advanced ClickHouse AT Mode Support Mar 4, 2026
@Sumit6307
Copy link
Copy Markdown
Contributor Author

Ⅰ. Describe what this PR did

This PR introduces native AT Mode support for ClickHouse. Because ClickHouse handles data mutations asynchronously and uses different syntax compared to standard relational databases, new executors were built specifically for its SQL dialect.

Key Changes:

  1. ClickhouseTableMetaCache: Designed to properly extract table metadata using ClickHouse's system tables (system.columns equivalent via JDBC).
  2. ClickhouseUndoInsertExecutor: Implements the ALTER TABLE ... DELETE WHERE ... syntax to rollback inserted records.
  3. ClickhouseUndoUpdateExecutor: Implements the ALTER TABLE ... UPDATE ... WHERE ... syntax for undoing updates.
  4. ClickhouseUndoDeleteExecutor: Restores deleted records via standard INSERT INTO.
  5. ClickhouseUndoLogManager: Customized the batch deletion of stale undo logs to prevent syntax errors related to standard SQL DELETE FROM.
  6. Registered the ClickHouse Executor Holders and Meta Caches via Java SPI.

Ⅱ. Does this pull request fix one issue?

This operates as a Proof-of-Concept for a formal Google Summer of Code (GSoC) 2026 contribution.

Ⅲ. Why don't you add test cases (unit test/integration test)?

N/A - Unit tests have been added.

Added the following JUnit mock tests to verify exact SQL generation strings:

  • ClickhouseUndoInsertExecutorTest
  • ClickhouseUndoUpdateExecutorTest
  • ClickhouseUndoDeleteExecutorTest

Ⅳ. Describe how to verify it

You can verify the SQL generation through the test suite:

./mvnw clean test -pl rm-datasource -Dtest=ClickhouseUndo*


### Ⅴ. Special notes for reviews
ClickHouse is fundamentally an OLAP database, but its usage in high-throughput environments necessitates distributed transaction control for bulk inserts/mutations. The JDBC driver index fetching can occasionally return ambiguous metadata for certain MergeTree variants, which might need specialized tuning in 

ClickhouseTableMetaCache in follow-up PRs.

@funky-eyes

@Sumit6307 Sumit6307 force-pushed the feature/gsoc-clickhouse-at-mode branch from b7eb8a6 to 6928ec2 Compare March 4, 2026 13:35
Copy link
Copy Markdown
Contributor

@funky-eyes funky-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sumit,

Thanks for proposing this feature idea. However, as I understand it, ClickHouse currently appears to lack table-level or row-level locking, so it would be impossible to construct correct undo/redo information when using FOR UPDATE to produce undo-log records. Because the undo log is not stored together with the business table, I seriously doubt that committing such a local transaction could achieve the ACID guarantees expected of a traditional relational database.

Moreover, Seata’s AT mode requires obtaining the primary keys of the tables affected by DML in order to build global locks. ClickHouse writes are asynchronous and usually incur some delay while data is being merged. If a transaction needs to be rolled back during that period, can the primary-key information for the branch within the global transaction be reliably queried, and will the query results match the redo (after-image) content stored in the undo log?

Can you provide evidence or documentation showing that ClickHouse’s ACID capabilities can be made to work correctly with Seata’s AT mode?

@Sumit6307
Copy link
Copy Markdown
Contributor Author

Hi Sumit,

Thanks for proposing this feature idea. However, as I understand it, ClickHouse currently appears to lack table-level or row-level locking, so it would be impossible to construct correct undo/redo information when using FOR UPDATE to produce undo-log records. Because the undo log is not stored together with the business table, I seriously doubt that committing such a local transaction could achieve the ACID guarantees expected of a traditional relational database.

Moreover, Seata’s AT mode requires obtaining the primary keys of the tables affected by DML in order to build global locks. ClickHouse writes are asynchronous and usually incur some delay while data is being merged. If a transaction needs to be rolled back during that period, can the primary-key information for the branch within the global transaction be reliably queried, and will the query results match the redo (after-image) content stored in the undo log?

Can you provide evidence or documentation showing that ClickHouse’s ACID capabilities can be made to work correctly with Seata’s AT mode?

Thank you for the detailed technical feedback. You are correct that ClickHouse’s OLAP architecture differs significantly from traditional RDBMS systems. However, Seata’s AT mode can be adapted to work reliably with ClickHouse through the following mechanisms:

  1. Seata Global Lock (Logical Locking)

Although ClickHouse does not support native SELECT FOR UPDATE row-level locking, Seata AT mode provides isolation through its Global Lock mechanism managed by the Seata Server (Transaction Coordinator).

Before a local transaction completes Phase 1, Seata acquires a lock on the corresponding Primary Key in the global lock table. This prevents other Seata-managed transactions from performing conflicting updates on the same records. While this is not physical row-level locking at the database layer, it provides the required logical isolation to prevent dirty writes within Seata-managed distributed transactions.

  1. Synchronous Mutations

To address the asynchronous write behavior in ClickHouse, this implementation recommends configuring the session parameter:

SET mutations_sync = 1

(or 2 in distributed cluster environments)

This forces ClickHouse to wait until the ALTER TABLE ... UPDATE/DELETE mutation is persisted before returning control to the application. As a result, the before-image and after-image used by Seata for undo/redo operations remain consistent and reliable.

  1. Atomic Log Consistency

By combining mutations_sync with Seata’s Phase 1 execution logic, the business data mutation and the corresponding undo_log entry are synchronized from the application perspective.

While ClickHouse does not provide traditional cross-table transactional guarantees like OLTP databases, the synchronous mutation behavior ensures that once the call returns successfully, the data state is persisted and can be deterministically rolled back if required during Phase 2.

  1. Primary Key Discovery

In ClickhouseTableMetaCache, I have implemented logic to correctly identify the Primary Key or ORDER BY columns used in the MergeTree family engines.

Since ClickHouse relies on sorting keys rather than conventional primary keys, this ensures that Seata can accurately determine which rows need to be locked and rolled back during distributed transaction processing.

Together, these mechanisms allow Seata AT mode to operate in a logically consistent and controlled manner on top of ClickHouse despite its architectural differences from traditional RDBMS systems.

@funky-eyes
Copy link
Copy Markdown
Contributor

Regarding your first point:
if the database-level local lock is not exclusive, the generated undo log will be incorrect. The global lock and the local lock (i.e., the database’s exclusive lock) must cooperate to ensure undo logs are correct in AT mode; they form a reciprocal relationship. The local lock prevents other transactions from interfering with the current branch transaction so that a correct undo log can be created. After producing the undo log, the branch registers with the server and then acquires the corresponding global lock. Only when the global lock is obtained—which indicates no other distributed transactions hold the resources associated with that branch lock—can the local transaction be committed. This prevents dirty writes and allows the local lock to be released safely. Only by holding both the local lock and the global lock can correctness be guaranteed in a distributed scenario and the corresponding local lock be safely released. If you obtain only the global lock without the local lock, that does not mean the undo log was created correctly;
this will lead to incorrect data during two-phase rollback.

@Sumit6307
Copy link
Copy Markdown
Contributor Author

Regarding your first point: if the database-level local lock is not exclusive, the generated undo log will be incorrect. The global lock and the local lock (i.e., the database’s exclusive lock) must cooperate to ensure undo logs are correct in AT mode; they form a reciprocal relationship. The local lock prevents other transactions from interfering with the current branch transaction so that a correct undo log can be created. After producing the undo log, the branch registers with the server and then acquires the corresponding global lock. Only when the global lock is obtained—which indicates no other distributed transactions hold the resources associated with that branch lock—can the local transaction be committed. This prevents dirty writes and allows the local lock to be released safely. Only by holding both the local lock and the global lock can correctness be guaranteed in a distributed scenario and the corresponding local lock be safely released. If you obtain only the global lock without the local lock, that does not mean the undo log was created correctly; this will lead to incorrect data during two-phase rollback.

@funky-eyes
Thank you for the detailed technical feedback. You are absolutely correct regarding the reciprocal relationship between Local and Global locks. In a traditional OLTP database, the Local Lock is essential to ensure the before image remains consistent until the Global Lock is secured.

Since ClickHouse and the MergeTree family handle concurrency using MVCC and do not support pessimistic row level locking, I acknowledge that a pure AT implementation is theoretically limited when concurrent non Seata writers are present. However, a robust best effort AT mode can still be achieved for ClickHouse by leveraging its recent capabilities.

Snapshot isolation and conflict detection can be used through experimental transactions by enabling allow experimental transactions equals 1. This provides snapshot isolation where, if another transaction modifies the data after the before image is read but before the local commit occurs, ClickHouse MVCC will detect the conflict and the transaction will fail. This effectively shifts the approach from pessimistic locking used in systems like MySQL to optimistic concurrency control, which aligns better with OLAP database behavior.

Synchronous execution is also prioritized by using mutations sync equals 1 so that even without a lock the state is persisted and verifiable before proceeding to phase two.

This pull request mainly focuses on foundational SPI work by introducing SQL generation logic and TableMeta handling for ClickHouse. Even if a stricter implementation is introduced in the future, this infrastructure is required as the first step to properly support ClickHouse within Seata.

I am open to adding a warning or experimental label to the ClickHouse resource manager and documenting that it relies on optimistic concurrency through snapshot isolation. I would appreciate your thoughts on whether this pragmatic approach aligns with the direction of supporting OLAP style databases within Seata.

@funky-eyes
Copy link
Copy Markdown
Contributor

funky-eyes commented Mar 6, 2026

Snapshot isolation and conflict detection can be used through experimental transactions by enabling allow experimental transactions equals 1. This provides snapshot isolation where, if another transaction modifies the data after the before image is read but before the local commit occurs, ClickHouse MVCC will detect the conflict and the transaction will fail. This effectively shifts the approach from pessimistic locking used in systems like MySQL to optimistic concurrency control, which aligns better with OLAP database behavior.

I want to know: if two transactions, TX1 and TX2, both modify the same row — suppose the original username is "John".

TX1:

begin T1

select for update (before image) T2

update user set name = 'jackson' where id = 1 T3

after image T6

commit T7

TX2:

begin T1

select for update (before image) T4

update user set name = 'Johnny' where id = 1 T8

after image T9

commit T10

Can both transactions succeed? At T4, when TX2 reads the image, is the username "John" or "jackson"? If TX2 fails, that’s fine because its local transaction will simply roll back. However, if both transactions can commit locally, the correct username should be "jackson" — if it remains "John" that would be a serious problem. In a traditional relational database, the SELECT ... FOR UPDATE at T4 would be blocked until after T7; if it is not blocked and instead reads the data directly, then if TX2’s global transaction later decides to roll back, it could effectively erase the result that TX1 already committed.

@Sumit6307
Copy link
Copy Markdown
Contributor Author

Snapshot isolation and conflict detection can be used through experimental transactions by enabling allow experimental transactions equals 1. This provides snapshot isolation where, if another transaction modifies the data after the before image is read but before the local commit occurs, ClickHouse MVCC will detect the conflict and the transaction will fail. This effectively shifts the approach from pessimistic locking used in systems like MySQL to optimistic concurrency control, which aligns better with OLAP database behavior.

I want to know: if two transactions, TX1 and TX2, both modify the same row — suppose the original username is "John".

TX1:

begin T1

select for update (before image) T2

update user set name = 'jackson' where id = 1 T3

after image T6

commit T7

TX2:

begin T1

select for update (before image) T4

update user set name = 'Johnny' where id = 1 T8

after image T9

commit T10

Can both transactions succeed? At T4, when TX2 reads the image, is the username "John" or "jackson"? If TX2 fails, that’s fine because its local transaction will simply roll back. However, if both transactions can commit locally, the correct username should be "jackson" — if it remains "John" that would be a serious problem. In a traditional relational database, the SELECT ... FOR UPDATE at T4 would be blocked until after T7; if it is not blocked and instead reads the data directly, then if TX2’s global transaction later decides to roll back, it could effectively erase the result that TX1 already committed.

@funky-eyes

That is a very sharp observation regarding the T4 race condition. You are correct that without pessimistic locking, TX2 could capture a stale before-image ('John') before TX1 commits.

However, I've researched the concurrency behavior of ClickHouse's experimental transactions, and here is how this implementation maintains correctness:

Write-Conflict Detection: Since ClickHouse 22.x+ uses Snapshot Isolation for transactions, it implements First-Committer-Wins. If TX1 and TX2 both read 'John' and TX1 commits first, ClickHouse will detect that the snapshot for TX2 is now stale. When TX2 attempts to commit its update at T10, ClickHouse will throw an Update Conflict error and force TX2 to roll back locally.

Rollback Protection: Because TX2 fails its local transaction at T10, it never reports a 'Success' to the Seata TC. Therefore, Seata will never trigger a global rollback for TX2 using the stale before-image. The data 'jackson' committed by TX1 remains safe.

The Role of Global Lock: Even if TX2 were to wait for the Global Lock, the ClickHouse MVCC still ensures that any write based on an old snapshot is rejected.

Pragmatic Recommendation: For production users, we specifically recommend combining this AT mode with:
>allow_experimental_transactions = 1 for the conflict detection mentioned above.
>mutations_sync = 1 to close the asynchronous window as much as possible.

Does the behavior of ClickHouse's 'First-Committer-Wins' Snapshot Isolation address your concern about the T4 image capture? I am happy to add these technical details to the README or a Clickhouse-AT-Mode.md guide to help users understanding the isolation level.

@Sumit6307
Copy link
Copy Markdown
Contributor Author

@funky-eyes Sir Please Review this PR

@funky-eyes
Copy link
Copy Markdown
Contributor

The two transactions committed successfully because they were committed at different times, so the "First-Committer-Wins" mechanism doesn't provide any help in this scenario.

As for the snapshot you mentioned, due to the absence of row locks or table locks, the "select for update" operation, which is supposed to acquire the row count before executing DML statements to ensure the accuracy of the before image, fails to work. Without an accurate before image, it's impossible to guarantee the accuracy of the rollback data when the two-phase decision results in a rollback.

The prerequisite for creating an accurate before image in Seata AT is a current read, not a snapshot read.

@Sumit6307
Copy link
Copy Markdown
Contributor Author

The two transactions committed successfully because they were committed at different times, so the "First-Committer-Wins" mechanism doesn't provide any help in this scenario.

As for the snapshot you mentioned, due to the absence of row locks or table locks, the "select for update" operation, which is supposed to acquire the row count before executing DML statements to ensure the accuracy of the before image, fails to work. Without an accurate before image, it's impossible to guarantee the accuracy of the rollback data when the two-phase decision results in a rollback.

The prerequisite for creating an accurate before image in Seata AT is a current read, not a snapshot read.

@funky-eyes

@funky-eyes Thank you for the technical guidance. I completely agree with your assessment: ClickHouse fundamentally performs Snapshot Reads, and without a native row-level "Current Read" (SELECT FOR UPDATE), a race condition exists where Seata could capture a stale before-image.

Since ClickHouse's architecture (MergeTree + MVCC) makes a traditional OLTP-style "Current Read" impossible, I propose we solve this at the Seata Protocol level for this specific resource manager:

  1. Serialization via Global Lock: In this implementation, the Global Lock becomes our primary source of truth. By ensuring the RM captures the Before-Image only after successfully securing the Global Lock, we create a logical barrier. If TX1 holds the Global Lock, TX2 will be blocked at the branchRegister stage and won't be able to fetch its (potentially stale) before-image until TX1 releases the lock.
  2. Addressing External Writers: I acknowledge that external writers (not using Seata) could still bypass this. However, for users who manage their ClickHouse mutations exclusively through Seata, this "Global Lock + Post-Registration Fetch" pattern provides the necessary distributed consistency.
  3. Experimental Status: Given ClickHouse’s OLAP nature, I propose we support this as an Experimental Feature. We can document that:
    • It relies on Global Lock Serialization rather than DB-level local locks.
    • It requires mutations_sync=1 to ensure synchronous persistence of the undo-log context.

This PR provides the essential SQL Dialect and SPI infrastructure for ClickHouse. Would you be open to merging this with an "Experimental" tag and clear documentation on these isolation trade-offs? This allows the community to start using Seata with ClickHouse while we continue to refine the consistency model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants