feat: Add Advanced ClickHouse AT Mode Support#8007
Conversation
…meout - doPostWithHttp2(..., callback, 30) used 30ms timeout; watch holds until event (~2s) - Change to 30000ms to match blocking watch() test so callback runs and latch counts down - Fixes 'expected: <true> but was: <false>' at latch.await(35, TimeUnit.SECONDS)
|
b7eb8a6 to
6928ec2
Compare
funky-eyes
left a comment
There was a problem hiding this comment.
Hi Sumit,
Thanks for proposing this feature idea. However, as I understand it, ClickHouse currently appears to lack table-level or row-level locking, so it would be impossible to construct correct undo/redo information when using FOR UPDATE to produce undo-log records. Because the undo log is not stored together with the business table, I seriously doubt that committing such a local transaction could achieve the ACID guarantees expected of a traditional relational database.
Moreover, Seata’s AT mode requires obtaining the primary keys of the tables affected by DML in order to build global locks. ClickHouse writes are asynchronous and usually incur some delay while data is being merged. If a transaction needs to be rolled back during that period, can the primary-key information for the branch within the global transaction be reliably queried, and will the query results match the redo (after-image) content stored in the undo log?
Can you provide evidence or documentation showing that ClickHouse’s ACID capabilities can be made to work correctly with Seata’s AT mode?
Thank you for the detailed technical feedback. You are correct that ClickHouse’s OLAP architecture differs significantly from traditional RDBMS systems. However, Seata’s AT mode can be adapted to work reliably with ClickHouse through the following mechanisms:
Although ClickHouse does not support native SELECT FOR UPDATE row-level locking, Seata AT mode provides isolation through its Global Lock mechanism managed by the Seata Server (Transaction Coordinator). Before a local transaction completes Phase 1, Seata acquires a lock on the corresponding Primary Key in the global lock table. This prevents other Seata-managed transactions from performing conflicting updates on the same records. While this is not physical row-level locking at the database layer, it provides the required logical isolation to prevent dirty writes within Seata-managed distributed transactions.
To address the asynchronous write behavior in ClickHouse, this implementation recommends configuring the session parameter: SET mutations_sync = 1 (or 2 in distributed cluster environments) This forces ClickHouse to wait until the ALTER TABLE ... UPDATE/DELETE mutation is persisted before returning control to the application. As a result, the before-image and after-image used by Seata for undo/redo operations remain consistent and reliable.
By combining mutations_sync with Seata’s Phase 1 execution logic, the business data mutation and the corresponding undo_log entry are synchronized from the application perspective. While ClickHouse does not provide traditional cross-table transactional guarantees like OLTP databases, the synchronous mutation behavior ensures that once the call returns successfully, the data state is persisted and can be deterministically rolled back if required during Phase 2.
In ClickhouseTableMetaCache, I have implemented logic to correctly identify the Primary Key or ORDER BY columns used in the MergeTree family engines. Since ClickHouse relies on sorting keys rather than conventional primary keys, this ensures that Seata can accurately determine which rows need to be locked and rolled back during distributed transaction processing. Together, these mechanisms allow Seata AT mode to operate in a logically consistent and controlled manner on top of ClickHouse despite its architectural differences from traditional RDBMS systems. |
|
Regarding your first point: |
@funky-eyes Since ClickHouse and the MergeTree family handle concurrency using MVCC and do not support pessimistic row level locking, I acknowledge that a pure AT implementation is theoretically limited when concurrent non Seata writers are present. However, a robust best effort AT mode can still be achieved for ClickHouse by leveraging its recent capabilities. Snapshot isolation and conflict detection can be used through experimental transactions by enabling allow experimental transactions equals 1. This provides snapshot isolation where, if another transaction modifies the data after the before image is read but before the local commit occurs, ClickHouse MVCC will detect the conflict and the transaction will fail. This effectively shifts the approach from pessimistic locking used in systems like MySQL to optimistic concurrency control, which aligns better with OLAP database behavior. Synchronous execution is also prioritized by using mutations sync equals 1 so that even without a lock the state is persisted and verifiable before proceeding to phase two. This pull request mainly focuses on foundational SPI work by introducing SQL generation logic and TableMeta handling for ClickHouse. Even if a stricter implementation is introduced in the future, this infrastructure is required as the first step to properly support ClickHouse within Seata. I am open to adding a warning or experimental label to the ClickHouse resource manager and documenting that it relies on optimistic concurrency through snapshot isolation. I would appreciate your thoughts on whether this pragmatic approach aligns with the direction of supporting OLAP style databases within Seata. |
I want to know: if two transactions, TX1 and TX2, both modify the same row — suppose the original username is "John". TX1: TX2: Can both transactions succeed? At T4, when TX2 reads the image, is the username "John" or "jackson"? If TX2 fails, that’s fine because its local transaction will simply roll back. However, if both transactions can commit locally, the correct username should be "jackson" — if it remains "John" that would be a serious problem. In a traditional relational database, the SELECT ... FOR UPDATE at T4 would be blocked until after T7; if it is not blocked and instead reads the data directly, then if TX2’s global transaction later decides to roll back, it could effectively erase the result that TX1 already committed. |
That is a very sharp observation regarding the T4 race condition. You are correct that without pessimistic locking, TX2 could capture a stale before-image ('John') before TX1 commits. However, I've researched the concurrency behavior of ClickHouse's experimental transactions, and here is how this implementation maintains correctness:
Does the behavior of ClickHouse's 'First-Committer-Wins' Snapshot Isolation address your concern about the T4 image capture? I am happy to add these technical details to the README or a Clickhouse-AT-Mode.md guide to help users understanding the isolation level. |
|
@funky-eyes Sir Please Review this PR |
|
The two transactions committed successfully because they were committed at different times, so the "First-Committer-Wins" mechanism doesn't provide any help in this scenario. As for the snapshot you mentioned, due to the absence of row locks or table locks, the "select for update" operation, which is supposed to acquire the row count before executing DML statements to ensure the accuracy of the before image, fails to work. Without an accurate before image, it's impossible to guarantee the accuracy of the rollback data when the two-phase decision results in a rollback. The prerequisite for creating an accurate before image in Seata AT is a current read, not a snapshot read. |
@funky-eyes Thank you for the technical guidance. I completely agree with your assessment: ClickHouse fundamentally performs Snapshot Reads, and without a native row-level "Current Read" (SELECT FOR UPDATE), a race condition exists where Seata could capture a stale before-image. Since ClickHouse's architecture (MergeTree + MVCC) makes a traditional OLTP-style "Current Read" impossible, I propose we solve this at the Seata Protocol level for this specific resource manager:
This PR provides the essential SQL Dialect and SPI infrastructure for ClickHouse. Would you be open to merging this with an "Experimental" tag and clear documentation on these isolation trade-offs? This allows the community to start using Seata with ClickHouse while we continue to refine the consistency model. |
Ⅰ. Describe what this PR did
This PR introduces native AT Mode support for ClickHouse. Because ClickHouse handles data mutations asynchronously and uses different syntax compared to standard relational databases, new executors were built specifically for its SQL dialect.
Key Changes:
system.columnsequivalent via JDBC).ALTER TABLE ... DELETE WHERE ...syntax to rollback inserted records.ALTER TABLE ... UPDATE ... WHERE ...syntax for undoing updates.INSERT INTO.DELETE FROM.Ⅱ. Does this pull request fix one issue?
This operates as a Proof-of-Concept for a formal Google Summer of Code (GSoC) 2026 contribution.
Ⅲ. Why don't you add test cases (unit test/integration test)?
N/A - Unit tests have been added.
Added the following JUnit mock tests to verify exact SQL generation strings:
Ⅳ. Describe how to verify it
You can verify the SQL generation through the test suite: