StreamWAL#11
Draft
jtanza wants to merge 14 commits into
Draft
Conversation
egladysh
pushed a commit
that referenced
this pull request
Jun 18, 2026
…tion Summary: ## Problem A READ COMMITTED retry triggered by a deadlock / abort could SIGSEGV at `AfterTriggerEndSubXact` during the subsequent ROLLBACK. Stack trace: ``` Signal: SIGSEGV #0 GetMemoryChunkContext(pointer=0x0) memutils.h:141:12 (inlined) #1 pfree(pointer=0x0) mcxt.c:1500:26 (inlined) #2 AfterTriggerEndSubXact trigger.c:5657:4 #3 AbortSubTransaction xact.c:5726:3 #4 CommitTransactionCommand xact.c:3650:4 #5 CommitTransactionCommand xact.c:0 #6 yb_exec_simple_query_impl postgres.c:3023:3 (inlined finish_xact_command) #7 yb_exec_simple_query_impl postgres.c:1494:4 #8 yb_exec_simple_query_impl postgres.c:5804:2 #9 yb_exec_query_wrapper_one_attempt postgres.c:5764:3 #10 PostgresMain postgres.c:5796:3 #11 PostgresMain postgres.c:5821:2 (inlined yb_exec_simple_query) yugabyte#12 PostgresMain postgres.c:6623:8 ``` ### Reading the stack - **#2 `AfterTriggerEndSubXact` (trigger.c:5657)** -- the call site is `pfree(afterTriggers.state)` in the abort branch, gated on `trans_stack[my_level].state != NULL`. - **#3-#5 `CommitTransactionCommand` -> `AbortSubTransaction`** -- called because of ROLLBACK. ### Root cause The retry path before this change ran, in order: ``` yb_restart_transaction -> YBCRestartWriteTransaction -> AfterTriggerEndXact(false) // wipes trans_stack to NULL, maxtransdepth to 0 // and afterTriggers.state to NULL -> AfterTriggerBeginXact() -> RollbackAndReleaseCurrentSubTransaction -> YbBeginInternalSubTransactionForReadCommittedStatement -> AfterTriggerBeginSubXact -> MemoryContextAlloc(8 * sizeof(AfterTriggersTransData)) // NOT AllocZero: leaves N-1 slots uninitialized -> initializes trans_stack[my_level] only ``` Any live subxact with level < `my_level` -- a user `SAVEPOINT` and the per-statement RC internal subxact above it -- now points at an uninitialized slot. ROLLBACK reads garbage as `state`, the non-NULL gate passes, and `pfree(afterTriggers.state)` faults because `EndXact` already cleared that field. ## Fix `YBCRestartWriteTransaction` now rolls back every savepoint / subtransaction before recreating the top-level write state, so the PG-side `trans_stack` is empty by the time the surgical reset wipes the after-trigger state. Removed the now-redundant `RollbackAndReleaseCurrentSubTransaction()` from the else branch of `yb_restart_transaction`. Test Plan: Jenkins Reviewers: pjain, smishra Reviewed By: pjain Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D54387
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overall goal
This PR adds
StreamWAL, a new per-tablet, leader-only tserver RPC that delivers fully-decoded, committed change events as a stream ofCDCSDKProtoRecordPB— the same wire format the existing CDCSDK gRPC connector already consumes.The client owns all stream state (a per-tablet
(term, index)cursor); the server registers nothing. There is nocdc_statetable, no stream IDs, no master-driven control plane, and no per-stream aggregation.The data plane reuses the existing CDCSDK decoder family unchanged: transactional
WRITE_OPs are skipped on the wire and the corresponding rows are emitted atAPPLYINGtime by reading intents from IntentsDB, sandwiched betweenBEGIN/COMMITenvelopes stamped withcommit_hybrid_time.By default StreamWAL delivers records in WAL (apply) order. It also adds an optional, per-request consistent-commit-order mode that instead delivers committed records in commit-time order, watermark-gated, via a composite (term, index, commit_ht) cursor.
StreamWALlives alongsideGetChanges; no existing CDC code path is modified. The one cross-cutting server change is a new wall-clock intent-retention mechanism (gated behind a flag, default-off) that lets a checkpoint-less consumer keep just-applied intents readable.These changes are broken into a number of sections:
Proto changes
StreamWALRPC to the existingCDCService, three new top-level messages (StreamWalRequestPB,StreamWalResponsePB,StreamWalCursorPBCDCSDKProtoRecordPB(aborted_subtxn_set,split_tablet_request)CDCErrorPB::Codevalue (INTENTS_GC_ERROR = 15).StreamWalRequestPB.consistent_commit_order(bool, default false),StreamWalCursorPB.commit_ht(the commit-time frontier), andStreamWalResponsePB.resolution_safe_time(the per-tablet resolution watermark the batch gated on).Files changed:
src/yb/cdc/cdc_service.protoStreamWAL handler and control flow
CDCServiceImpl::StreamWALhandler.ReplicateMsgs and dispatches each through the decoder, leader/safe-time resolution,INTENTS_GC_ERRORdetection, and the partial-APPLYING batch/spill logic.StreamWalDecodeContext,StreamWalIntentResumeState,StreamWalDispatchResult) live incdc_producer.h.consistent_commit_order=true, the handler forks early intoCDCServiceImpl::HandleStreamWALConsistentCommitOrderand returns; the default (WAL-order) path is left entirely untouched.Files changed:
src/yb/cdc/cdc_service.cc,src/yb/cdc/cdc_service.h,src/yb/cdc/cdc_producer.hDecoder helpers reusing the CDCSDK pipeline
DispatchWalOpForStreamWAL,DispatchApplyingForStreamWALImpl) and envelope builders (PopulateSyntheticBootstrapDDLs,PopulateStreamWalApplyingRecord,PopulateStreamWalSplitRecord,StampOpIdOnLeadingBeginForStreamWAL).Populate*decoder family (PopulateCDCSDKWriteRecord,PopulateCDCSDKIntentRecord, DDL/truncate fillers), which are reused verbatim.Files changed:
src/yb/cdc/cdcsdk_producer.ccWall-clock intent retention
--intents_min_seconds_to_retain(default0), a time-based parallel to--log_min_seconds_to_retainfor IntentsDB, so a checkpoint-less consumer can read committed-but-just-applied intents without a per-stream lease barrier.TransactionParticipant::Cleanupand the IntentsDB compaction filter now consult the commit hybrid time when deciding whether to GC a transaction's intents, andTransactionIdApplyOpIdMapintransaction.hchanges from mapping to a bareOpIdto aTransactionApplyOpIdInfostruct that also carriescommit_ht.0, every path falls through to existing behavior, so this is a no-op for non-CDC and lease-based CDC clusters.RunningTransaction::SetApplyHybridTimes(stamps commit/log HT for single-batch applies),Tablet/docdb::CountTxnReverseIndexEntriesForCDC(distinguishes a real intent-GC from a zero-intent txn to avoid spuriousINTENTS_GC_ERROR).Files changed:
src/yb/tablet/transaction_participant.cc,src/yb/docdb/docdb_compaction_filter_intents.cc,src/yb/common/transaction.h,src/yb/tablet/running_transaction.cc,src/yb/tablet/running_transaction.h,src/yb/tablet/tablet.cc,src/yb/tablet/tablet.h,src/yb/docdb/docdb.cc,src/yb/docdb/docdb.hStream-less metadata bootstrap
StreamMetadatacan be constructed for stream-id-less callers, lettingStreamWALreuse the existing decoder context without registering a stream.Files changed:
src/yb/cdc/xrepl_stream_metadata.cc,src/yb/cdc/xrepl_stream_metadata.hConsistent commit-order mode (optional, per-request)
order, gated behind the per-tablet resolution watermark, using a composite cursor: (term, index) is the WAL
re-read floor (and retention point) and commit_ht is the commit-time frontier (the server skips records with
commit_time <= commit_ht on resume). Floor + frontier together discharge ordering and dedup with zero client
state.
StreamWalConsistentOutput structs (in cdc_producer.h). It mirrors the consistent branch of GetChangesForCDCSDK but emits via the existing StreamWAL decoder dispatch and packages the composite cursor.
FLAGS_cdc_enable_consistent_recordsgflag that governs the legacy GetChanges path.Files changed:
src/yb/cdc/cdcsdk_producer.cc,src/yb/cdc/cdc_producer.h,src/yb/cdc/cdc_service.cc,Java client bindings & leader-hint failover
streamWAL(...)client methods and the request/response wrappers.TabletClient.decodegains a CDC-scoped branch that, onLEADER_NOT_READY, applies thetablet_consensus_infohint from the response to refresh the meta-cache leader pointer before retrying (RemoteTablet.applyLeaderHint), avoiding a master round-trip on leader failover mid-stream.StreamWalRequest/StreamWalResponsepairs, so existing RPC dispatch is unaffected.Files changed:
java/yb-client/src/main/java/org/yb/client/AsyncYBClient.java,YBClient.java,StreamWalRequest.java(new),StreamWalResponse.java(new),TabletClient.javaServer-side observability metrics
stream-id-less design, these attach directly to the tablet's existing metric entity (not a separate
per-stream entity like CDCSDKTabletMetrics/XClusterTabletMetrics), so they aggregate at the table level
handler_latency_yb_cdc_CDCService_StreamWAL family (plus service_request_bytes /
service_response_bytes), so no custom RPC metrics are added.
streamwal_intents_gc_errors increments on the two INTENTS_GC_ERROR return paths, and the success-path
gauge/counter updates before RespondSuccess(). No existing metric or handler is modified.
Metrics added:
streamwal_records_sent— counter (units), aggregated sum. Total decoded change records sent overStreamWAL for this tablet.
streamwal_traffic_sent— counter (bytes), aggregated sum. Total decoded record payload bytes sent overStreamWAL.
streamwal_sent_lag_micros— gauge (µs), aggregated max. Lag between the leader's safe time and thecommit time of the last record sent.
streamwal_wal_lag_index— gauge (ops), aggregated max. WAL ops between the leader tip and the readcursor (leader_tip.index − next_op_id.index).
streamwal_intent_retention_window_secs— gauge (s), aggregated max. Current--intents_min_seconds_to_retain; lets dashboards derive intent-retention headroom as window − sent_lag.
A true kMin "headroom" gauge is not expressible (only kSum/kMax aggregation functions exist), so the
window is surfaced and headroom is computed against the kMax lag downstream.
streamwal_intents_gc_errors— counter (units), aggregated sum. Count of INTENTS_GC_ERROR responses(intents GC'd before StreamWAL could read them). Must always be zero; non-zero indicates data loss.
Files changed:
src/yb/cdc/xrepl_metrics.h,src/yb/cdc/xrepl_metrics.cc,src/yb/cdc/cdc_service.cc,src/yb/cdc/stream_wal-test.cc