Skip to content

[CDC] DocDB: Per-phase latency and batch-size metrics for CDCSDK GetChanges#10

Draft
bdbene wants to merge 2 commits into
Shopify:masterfrom
bdbene:worktree-cdc-getchanges-metrics
Draft

[CDC] DocDB: Per-phase latency and batch-size metrics for CDCSDK GetChanges#10
bdbene wants to merge 2 commits into
Shopify:masterfrom
bdbene:worktree-cdc-getchanges-metrics

Conversation

@bdbene

@bdbene bdbene commented May 25, 2026

Copy link
Copy Markdown

Summary

Adds per-tablet histograms and counters on CDCSDKTabletMetrics so operators can attribute slow CDCSDK GetChanges RPCs to specific phases. Today the only visibility is the server-wide handler_latency_yb_cdc_CDCService_GetChanges and a few aggregate counters, which is not enough to tell whether the time is going to WAL reads, intent reads, schema lookups, the cdc_state checkpoint write, or somewhere else.

Scope: CDCSDK only. The xCluster path is unchanged.

Per-phase latency histograms (microseconds, per stream/tablet)

Total RPC duration is already covered by the server-wide handler_latency_yb_cdc_CDCService_GetChanges; the per-phase metrics below let operators attribute time within that total.

Metric What it measures
cdcsdk_get_changes_preflight_latency RPC entry up to GetChangesForCDCSDK (semaphore, validation, stream/tablet/leader lookup, schema/enum/composite caches)
cdcsdk_get_last_checkpoint_latency GetLastCheckpoint (cdc_state read) when the client sends no from_op_id
cdcsdk_update_checkpoint_latency UpdateCheckpointAndActiveTime (cdc_state write) on EXPLICIT ack
cdcsdk_wal_read_latency GetConsistentWALRecords / GetWALRecords
cdcsdk_process_intents_latency ProcessIntentsWithInvalidSchemaRetry per multi-shard txn
cdcsdk_get_intents_latency tablet->GetIntentsForCDC alone (separates IntentsDB read from record population)
cdcsdk_populate_write_record_latency PopulateCDCSDKWriteRecord for single-shard writes
cdcsdk_schema_lookup_latency client->GetTableSchemaFromSysCatalog (master round-trips)
cdcsdk_safe_time_wait_latency GetConsistentStreamSafeTime (waiting for safe time / txn load)

Per-call batch-size histograms (one observation per successful response)

cdcsdk_wal_records_read, cdcsdk_wal_bytes_read, cdcsdk_intents_per_txn, cdcsdk_response_records, cdcsdk_response_bytes.

Per-optype counters

Incremented in the GetChangesForCDCSDK main switch: cdcsdk_write_ops_seen, cdcsdk_update_txn_ops_seen, cdcsdk_change_metadata_ops_seen, cdcsdk_truncate_ops_seen, cdcsdk_split_ops_seen.

Implementation notes

  • Reuses the existing CDCThroughputMetrics struct that already threads counters from the producer to the service layer. Extended with a non-owning xrepl::CDCSDKTabletMetrics* pointer plus per-call WAL byte/record accumulators.
  • CDCServiceImpl::GetChanges acquires the per-tablet metric via the existing GetCDCSDKTabletMetrics() lookup (skipped for xCluster and the sys catalog tablet), populates the pointer before calling GetChangesForCDCSDK, and emits batch-size histograms at the end of the call.
  • All observation sites use ScopedLatencyMetric<EventStats>, which is null-safe — so the same code paths work when the per-tablet pointer is unset (xCluster, sys catalog, metric acquisition failure).
  • Cardinality: roughly triples the per-tablet metric series. Same (streams × tablets) scaling as the existing cdcsdk_sent_lag_micros etc., so no new cardinality regime, but worth measuring tserver heap impact on a busy cluster.

Test plan

  • ./yb_build.sh release daemons initdb --sj --skip-pg-parquet --no-odyssey --no-ybc builds clean
  • Existing CDC tests pass: ./yb_build.sh release --cxx-test cdc_service-test and cdcsdk_producer-test
  • Manual smoke: spin up a single-node tserver, create a CDCSDK stream with the Debezium gRPC connector, run a workload, and verify the new cdcsdk_*_latency / cdcsdk_*_read / cdcsdk_*_per_txn series appear at http://<tserver>:9000/prometheus-metrics with non-zero counts and sensible percentiles
  • Confirm the metrics show up labeled by table_name / tablet_id / stream_id (same labels as cdcsdk_sent_lag_micros)
  • Confirm sys catalog tablet emits no new metrics (existing guard reused)

bdbene added 2 commits May 25, 2026 16:35
…hanges

Adds per-tablet histograms and counters on CDCSDKTabletMetrics so operators
can attribute slow GetChanges RPCs to specific phases. Today the only
visibility is the server-wide handler_latency and a few aggregate counters,
which is not enough to tell whether time is going to WAL reads, intent reads,
schema lookups, the cdc_state checkpoint write, or somewhere else.

Per-phase latency histograms (microseconds, per stream/tablet):
- cdcsdk_get_changes_total_latency      total RPC duration (per-tablet view)
- cdcsdk_get_changes_preflight_latency  RPC entry up to GetChangesForCDCSDK
- cdcsdk_get_last_checkpoint_latency    cdc_state read on missing from_op_id
- cdcsdk_update_checkpoint_latency      cdc_state write on EXPLICIT ack
- cdcsdk_wal_read_latency               GetConsistentWALRecords / GetWALRecords
- cdcsdk_process_intents_latency        ProcessIntentsWithInvalidSchemaRetry
- cdcsdk_get_intents_latency            tablet->GetIntentsForCDC alone
- cdcsdk_populate_write_record_latency  single-shard PopulateCDCSDKWriteRecord
- cdcsdk_schema_lookup_latency          client->GetTableSchemaFromSysCatalog
- cdcsdk_safe_time_wait_latency         GetConsistentStreamSafeTime

Per-call batch-size histograms (one observation per successful response):
- cdcsdk_wal_records_read / cdcsdk_wal_bytes_read
- cdcsdk_intents_per_txn
- cdcsdk_response_records / cdcsdk_response_bytes

Per-optype counters incremented in the GetChangesForCDCSDK main switch:
- cdcsdk_write_ops_seen, cdcsdk_update_txn_ops_seen,
  cdcsdk_change_metadata_ops_seen, cdcsdk_truncate_ops_seen,
  cdcsdk_split_ops_seen

Plumbing: the existing CDCThroughputMetrics struct that already threads
counters from the producer to the service layer is extended with a
non-owning xrepl::CDCSDKTabletMetrics* pointer plus per-call WAL accumulators.
CDCServiceImpl::GetChanges acquires the metric instance via the existing
GetCDCSDKTabletMetrics() lookup (skipped for xCluster and the sys catalog
tablet), populates the pointer before calling GetChangesForCDCSDK, and emits
batch-size histograms at the end of the call. ScopedLatencyMetric (null-safe)
is used at every observation site so the same code paths work on the xCluster
path where the per-tablet pointer is null.

Scope: CDCSDK only. xCluster path is unchanged.
The server-wide handler_latency_yb_cdc_CDCService_GetChanges already covers
total RPC duration, and the per-phase histograms in this PR let operators
attribute time per phase. The extra per-tablet total adds no signal and
costs one EventStats per (stream, tablet).
@bdbene

bdbene commented May 25, 2026

Copy link
Copy Markdown
Author

/gemini review

egladysh pushed a commit that referenced this pull request Jun 18, 2026
…tion

Summary:
## Problem

A READ COMMITTED retry triggered by a deadlock / abort could SIGSEGV at
`AfterTriggerEndSubXact` during the subsequent ROLLBACK.

Stack trace:

```
Signal: SIGSEGV
  #0  GetMemoryChunkContext(pointer=0x0)            memutils.h:141:12  (inlined)
  #1  pfree(pointer=0x0)                            mcxt.c:1500:26     (inlined)
  #2  AfterTriggerEndSubXact                        trigger.c:5657:4
  #3  AbortSubTransaction                           xact.c:5726:3
  #4  CommitTransactionCommand                      xact.c:3650:4
  #5  CommitTransactionCommand                      xact.c:0
  #6  yb_exec_simple_query_impl                     postgres.c:3023:3  (inlined finish_xact_command)
  #7  yb_exec_simple_query_impl                     postgres.c:1494:4
  #8  yb_exec_simple_query_impl                     postgres.c:5804:2
  #9  yb_exec_query_wrapper_one_attempt             postgres.c:5764:3
  #10 PostgresMain                                  postgres.c:5796:3
  #11 PostgresMain                                  postgres.c:5821:2  (inlined yb_exec_simple_query)
  yugabyte#12 PostgresMain                                  postgres.c:6623:8
```

### Reading the stack

- **#2 `AfterTriggerEndSubXact` (trigger.c:5657)** -- the call site is
  `pfree(afterTriggers.state)` in the abort branch, gated on
  `trans_stack[my_level].state != NULL`.

- **#3-#5 `CommitTransactionCommand` -> `AbortSubTransaction`** -- called
  because of ROLLBACK.

### Root cause

The retry path before this change ran, in order:

```
yb_restart_transaction
  -> YBCRestartWriteTransaction
       -> AfterTriggerEndXact(false)   // wipes trans_stack to NULL, maxtransdepth to 0
                                       //   and afterTriggers.state to NULL
       -> AfterTriggerBeginXact()
  -> RollbackAndReleaseCurrentSubTransaction
  -> YbBeginInternalSubTransactionForReadCommittedStatement
       -> AfterTriggerBeginSubXact
            -> MemoryContextAlloc(8 * sizeof(AfterTriggersTransData))
              // NOT AllocZero: leaves N-1 slots uninitialized
            -> initializes trans_stack[my_level] only
```

Any live subxact with level < `my_level` -- a user `SAVEPOINT` and the
per-statement RC internal subxact above it -- now points at an
uninitialized slot. ROLLBACK reads garbage as `state`, the non-NULL
gate passes, and `pfree(afterTriggers.state)` faults because `EndXact`
already cleared that field.

## Fix

`YBCRestartWriteTransaction` now rolls back every savepoint /
subtransaction before recreating the top-level write state, so the
PG-side `trans_stack` is empty by the time the surgical reset wipes
the after-trigger state.

Removed the now-redundant `RollbackAndReleaseCurrentSubTransaction()` from
the else branch of `yb_restart_transaction`.

Test Plan: Jenkins

Reviewers: pjain, smishra

Reviewed By: pjain

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D54387
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant