[CDC] DocDB: Per-phase latency and batch-size metrics for CDCSDK GetChanges by bdbene · Pull Request #10 · Shopify/yugabyte-db

bdbene · 2026-05-25T20:38:07Z

Summary

Adds per-tablet histograms and counters on CDCSDKTabletMetrics so operators can attribute slow CDCSDK GetChanges RPCs to specific phases. Today the only visibility is the server-wide handler_latency_yb_cdc_CDCService_GetChanges and a few aggregate counters, which is not enough to tell whether the time is going to WAL reads, intent reads, schema lookups, the cdc_state checkpoint write, or somewhere else.

Scope: CDCSDK only. The xCluster path is unchanged.

Per-phase latency histograms (microseconds, per stream/tablet)

Total RPC duration is already covered by the server-wide handler_latency_yb_cdc_CDCService_GetChanges; the per-phase metrics below let operators attribute time within that total.

Metric	What it measures
`cdcsdk_get_changes_preflight_latency`	RPC entry up to `GetChangesForCDCSDK` (semaphore, validation, stream/tablet/leader lookup, schema/enum/composite caches)
`cdcsdk_get_last_checkpoint_latency`	`GetLastCheckpoint` (`cdc_state` read) when the client sends no `from_op_id`
`cdcsdk_update_checkpoint_latency`	`UpdateCheckpointAndActiveTime` (`cdc_state` write) on EXPLICIT ack
`cdcsdk_wal_read_latency`	`GetConsistentWALRecords` / `GetWALRecords`
`cdcsdk_process_intents_latency`	`ProcessIntentsWithInvalidSchemaRetry` per multi-shard txn
`cdcsdk_get_intents_latency`	`tablet->GetIntentsForCDC` alone (separates IntentsDB read from record population)
`cdcsdk_populate_write_record_latency`	`PopulateCDCSDKWriteRecord` for single-shard writes
`cdcsdk_schema_lookup_latency`	`client->GetTableSchemaFromSysCatalog` (master round-trips)
`cdcsdk_safe_time_wait_latency`	`GetConsistentStreamSafeTime` (waiting for safe time / txn load)

Per-call batch-size histograms (one observation per successful response)

cdcsdk_wal_records_read, cdcsdk_wal_bytes_read, cdcsdk_intents_per_txn, cdcsdk_response_records, cdcsdk_response_bytes.

Per-optype counters

Incremented in the GetChangesForCDCSDK main switch: cdcsdk_write_ops_seen, cdcsdk_update_txn_ops_seen, cdcsdk_change_metadata_ops_seen, cdcsdk_truncate_ops_seen, cdcsdk_split_ops_seen.

Implementation notes

Reuses the existing CDCThroughputMetrics struct that already threads counters from the producer to the service layer. Extended with a non-owning xrepl::CDCSDKTabletMetrics* pointer plus per-call WAL byte/record accumulators.
CDCServiceImpl::GetChanges acquires the per-tablet metric via the existing GetCDCSDKTabletMetrics() lookup (skipped for xCluster and the sys catalog tablet), populates the pointer before calling GetChangesForCDCSDK, and emits batch-size histograms at the end of the call.
All observation sites use ScopedLatencyMetric<EventStats>, which is null-safe — so the same code paths work when the per-tablet pointer is unset (xCluster, sys catalog, metric acquisition failure).
Cardinality: roughly triples the per-tablet metric series. Same (streams × tablets) scaling as the existing cdcsdk_sent_lag_micros etc., so no new cardinality regime, but worth measuring tserver heap impact on a busy cluster.

Test plan

./yb_build.sh release daemons initdb --sj --skip-pg-parquet --no-odyssey --no-ybc builds clean
Existing CDC tests pass: ./yb_build.sh release --cxx-test cdc_service-test and cdcsdk_producer-test
Manual smoke: spin up a single-node tserver, create a CDCSDK stream with the Debezium gRPC connector, run a workload, and verify the new cdcsdk_*_latency / cdcsdk_*_read / cdcsdk_*_per_txn series appear at http://<tserver>:9000/prometheus-metrics with non-zero counts and sensible percentiles
Confirm the metrics show up labeled by table_name / tablet_id / stream_id (same labels as cdcsdk_sent_lag_micros)
Confirm sys catalog tablet emits no new metrics (existing guard reused)

…hanges Adds per-tablet histograms and counters on CDCSDKTabletMetrics so operators can attribute slow GetChanges RPCs to specific phases. Today the only visibility is the server-wide handler_latency and a few aggregate counters, which is not enough to tell whether time is going to WAL reads, intent reads, schema lookups, the cdc_state checkpoint write, or somewhere else. Per-phase latency histograms (microseconds, per stream/tablet): - cdcsdk_get_changes_total_latency total RPC duration (per-tablet view) - cdcsdk_get_changes_preflight_latency RPC entry up to GetChangesForCDCSDK - cdcsdk_get_last_checkpoint_latency cdc_state read on missing from_op_id - cdcsdk_update_checkpoint_latency cdc_state write on EXPLICIT ack - cdcsdk_wal_read_latency GetConsistentWALRecords / GetWALRecords - cdcsdk_process_intents_latency ProcessIntentsWithInvalidSchemaRetry - cdcsdk_get_intents_latency tablet->GetIntentsForCDC alone - cdcsdk_populate_write_record_latency single-shard PopulateCDCSDKWriteRecord - cdcsdk_schema_lookup_latency client->GetTableSchemaFromSysCatalog - cdcsdk_safe_time_wait_latency GetConsistentStreamSafeTime Per-call batch-size histograms (one observation per successful response): - cdcsdk_wal_records_read / cdcsdk_wal_bytes_read - cdcsdk_intents_per_txn - cdcsdk_response_records / cdcsdk_response_bytes Per-optype counters incremented in the GetChangesForCDCSDK main switch: - cdcsdk_write_ops_seen, cdcsdk_update_txn_ops_seen, cdcsdk_change_metadata_ops_seen, cdcsdk_truncate_ops_seen, cdcsdk_split_ops_seen Plumbing: the existing CDCThroughputMetrics struct that already threads counters from the producer to the service layer is extended with a non-owning xrepl::CDCSDKTabletMetrics* pointer plus per-call WAL accumulators. CDCServiceImpl::GetChanges acquires the metric instance via the existing GetCDCSDKTabletMetrics() lookup (skipped for xCluster and the sys catalog tablet), populates the pointer before calling GetChangesForCDCSDK, and emits batch-size histograms at the end of the call. ScopedLatencyMetric (null-safe) is used at every observation site so the same code paths work on the xCluster path where the per-tablet pointer is null. Scope: CDCSDK only. xCluster path is unchanged.

The server-wide handler_latency_yb_cdc_CDCService_GetChanges already covers total RPC duration, and the per-phase histograms in this PR let operators attribute time per phase. The extra per-tablet total adds no signal and costs one EventStats per (stream, tablet).

bdbene · 2026-05-25T20:42:34Z

/gemini review

…tion Summary: ## Problem A READ COMMITTED retry triggered by a deadlock / abort could SIGSEGV at `AfterTriggerEndSubXact` during the subsequent ROLLBACK. Stack trace: ``` Signal: SIGSEGV #0 GetMemoryChunkContext(pointer=0x0) memutils.h:141:12 (inlined) #1 pfree(pointer=0x0) mcxt.c:1500:26 (inlined) #2 AfterTriggerEndSubXact trigger.c:5657:4 #3 AbortSubTransaction xact.c:5726:3 #4 CommitTransactionCommand xact.c:3650:4 #5 CommitTransactionCommand xact.c:0 #6 yb_exec_simple_query_impl postgres.c:3023:3 (inlined finish_xact_command) #7 yb_exec_simple_query_impl postgres.c:1494:4 #8 yb_exec_simple_query_impl postgres.c:5804:2 #9 yb_exec_query_wrapper_one_attempt postgres.c:5764:3 #10 PostgresMain postgres.c:5796:3 #11 PostgresMain postgres.c:5821:2 (inlined yb_exec_simple_query) yugabyte#12 PostgresMain postgres.c:6623:8 ``` ### Reading the stack - **#2 `AfterTriggerEndSubXact` (trigger.c:5657)** -- the call site is `pfree(afterTriggers.state)` in the abort branch, gated on `trans_stack[my_level].state != NULL`. - **#3-#5 `CommitTransactionCommand` -> `AbortSubTransaction`** -- called because of ROLLBACK. ### Root cause The retry path before this change ran, in order: ``` yb_restart_transaction -> YBCRestartWriteTransaction -> AfterTriggerEndXact(false) // wipes trans_stack to NULL, maxtransdepth to 0 // and afterTriggers.state to NULL -> AfterTriggerBeginXact() -> RollbackAndReleaseCurrentSubTransaction -> YbBeginInternalSubTransactionForReadCommittedStatement -> AfterTriggerBeginSubXact -> MemoryContextAlloc(8 * sizeof(AfterTriggersTransData)) // NOT AllocZero: leaves N-1 slots uninitialized -> initializes trans_stack[my_level] only ``` Any live subxact with level < `my_level` -- a user `SAVEPOINT` and the per-statement RC internal subxact above it -- now points at an uninitialized slot. ROLLBACK reads garbage as `state`, the non-NULL gate passes, and `pfree(afterTriggers.state)` faults because `EndXact` already cleared that field. ## Fix `YBCRestartWriteTransaction` now rolls back every savepoint / subtransaction before recreating the top-level write state, so the PG-side `trans_stack` is empty by the time the surgical reset wipes the after-trigger state. Removed the now-redundant `RollbackAndReleaseCurrentSubTransaction()` from the else branch of `yb_restart_transaction`. Test Plan: Jenkins Reviewers: pjain, smishra Reviewed By: pjain Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D54387

bdbene added 2 commits May 25, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CDC] DocDB: Per-phase latency and batch-size metrics for CDCSDK GetChanges#10

[CDC] DocDB: Per-phase latency and batch-size metrics for CDCSDK GetChanges#10
bdbene wants to merge 2 commits into
Shopify:masterfrom
bdbene:worktree-cdc-getchanges-metrics

bdbene commented May 25, 2026 •

edited

Loading

Uh oh!

bdbene commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bdbene commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Per-phase latency histograms (microseconds, per stream/tablet)

Per-call batch-size histograms (one observation per successful response)

Per-optype counters

Implementation notes

Test plan

Uh oh!

bdbene commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bdbene commented May 25, 2026 •

edited

Loading