kafka_consumer: bound and refine estimated_consumer_lag#24167
kafka_consumer: bound and refine estimated_consumer_lag#24167piochelepiotr wants to merge 12 commits into
Conversation
Cap left-extrapolation of the broker timestamp cache so a consumer offset older than the oldest cached sample cannot extrapolate more than 10 minutes past it, keeping estimated_consumer_lag bounded. Use max(consumer_offset, low_watermark) as the offset basis for lag-in-time when cluster monitoring is enabled: messages below the low watermark are out of retention and unreachable, so they should not inflate the time lag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 9f76e78 | Docs | Datadog PR Page | Give us feedback! |
Replace single-oldest eviction with batch compaction (Visvalingam-Whyatt) triggered when the cache reaches capacity: keep the oldest and newest samples and drop the points that least distort the offset/timestamp curve, so the cache spans a longer history at a coarsening resolution and high lag is interpolated rather than extrapolated. At the same trigger, prune samples below the earliest consumer offset (keeping one anchor) since no consumer will ever interpolate there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use the partition low watermark as the prune floor when cluster monitoring is enabled (the physically meaningful "lowest readable offset"), falling back to the earliest committed consumer offset otherwise. The low watermark is now fetched before the cache update and reused for both pruning and the lag-in-time floor, so there is no extra broker call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously the log-start (low watermark) offsets were fetched twice per run when cluster monitoring and data streams were both enabled: once by the metadata collector for partition.size/topic.size/throughput, and again by the lag path for the lag-in-time and cache-pruning floor. Fetch them once in check(), gated on cluster monitoring, over all non-internal topic partitions, and share the result with both the data-streams lag path and the metadata collector. Removes the duplicate list_offsets(earliest) call and the divergent internal-topic handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…etch Drop the PR-added Client.get_low_watermark_offsets and the _get_low_watermark_offsets wrapper, which duplicated the existing ClusterMetadataCollector._fetch_earliest_offsets. The check now calls _fetch_earliest_offsets once under cluster monitoring and shares the result with both the data-streams lag/pruning path and the topic-metadata collection, so the earliest offsets are still fetched only once per run. This reverts client.py to master and keeps the cluster_metadata.py change to a small signature tweak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the redundant earliest_offsets alias and reference the passed-in low_watermark_offsets directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Clarify that the left-extrapolation cap bounds lag-in-time regardless of cluster monitoring or the low-watermark floor, and document why there is no symmetric right-side clamp (the newest cached sample is the just-collected highwater, which the consumer offset can never exceed). - Promote ClusterMetadataCollector.fetch_earliest_offsets to a public method since KafkaCheck now calls it across the class boundary. - Log a debug line when the cache-prune floor falls back from the low watermark to the earliest consumer offset. - Extract the Visvalingam-Whyatt significance closure into a module-level _interpolation_error helper. - Parameterize the _visvalingam_whyatt tests; add direct tests for _earliest_consumer_offsets, _prune_below_anchor, and the left-extrapolation cap through report_consumer_offsets_and_lag without a low watermark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ection Pass the topic-partition map computed in check() through collect_all_metadata into _collect_topic_metadata instead of fetching it again, so the cluster monitoring path makes the same number of get_topic_partitions calls as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validation ReportAll 21 validations passed. Show details
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9f76e7821a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for o in offsets: | ||
| if prev[o] is not None and nxt[o] is not None: | ||
| current[o] = _interpolation_error(o, prev, nxt, timestamps) | ||
| heap.append((current[o], o)) |
There was a problem hiding this comment.
Allow stale reset endpoints to age out
When a topic is recreated or reset and the new highwater is still above an old cached offset, the stale purge above only removes cached offsets greater than the new highwater, so lower offsets from the previous topic generation can remain. This compactor never enqueues the first/last offset, so the smallest cached offset is preserved forever instead of eventually being evicted by age as before; that stale timestamp can then keep poisoning estimated_consumer_lag interpolation for the new topic generation. Consider allowing endpoint samples that predate a reset to be dropped, or falling back to timestamp-based eviction for those cases.
Useful? React with 👍 / 👎.
| del timestamps[o] | ||
|
|
||
|
|
||
| def _visvalingam_whyatt(timestamps, target_count): |
There was a problem hiding this comment.
This could be a good bit slower than the previous eviction behavior. Do you think it will take too long on clusters with a lot of partitions?
There was a problem hiding this comment.
it's O(nlog(m)), so it will still be very fast. And it only runs 1/500 of the time.
The previous one was O(nlog(n)) and was running every time.
I don't expect that to be an issue.
What does this PR do?
Two refinements to the
kafka.estimated_consumer_lag(lag-in-time) calculation in the kafka_consumer check:Bound left-extrapolation. The broker timestamp cache stores
(highwater_offset → time)samples clustered near the latest offset. When a consumer offset is older than the oldest cached sample, the lag is estimated by extrapolating the affine offset/time fit into the past — an unreliable region that could produce an arbitrarily large lag. We now clamp the extrapolated timestamp to at mostLAG_EXTRAPOLATION_LIMIT_SECONDS(10 minutes) before the oldest cached sample. This is a continuous cap: reported lag rises smoothly up to roughlycache_window + 10minrather than jumping. Interpolation between known offsets is unaffected.Use the low watermark as a floor for the lag offset. A consumer that has fallen behind the partition low watermark can never read the messages between its committed offset and the low watermark — they are out of retention. So for lag-in-time we now interpolate from
max(consumer_offset, low_watermark)instead of the raw consumer offset. The offset-basedconsumer_lagmetric is unchanged. Low watermark offsets are fetched (viaAdminClient.list_offsets(earliest)) only whenenable_cluster_monitoringis enabled; with cluster monitoring off, behavior is unchanged.Motivation
Consumers replaying old data or lagging beyond the cache/retention window produced wildly overestimated
estimated_consumer_lagvalues. Both changes keep the metric meaningful and bounded in those cases.Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged🤖 Generated with Claude Code