Skip to content

Fix thv_reg_srv_servers_total/skills_total staleness and labels#784

Merged
rdimitrov merged 2 commits into
mainfrom
faceted-cork
May 12, 2026
Merged

Fix thv_reg_srv_servers_total/skills_total staleness and labels#784
rdimitrov merged 2 commits into
mainfrom
faceted-cork

Conversation

@rdimitrov
Copy link
Copy Markdown
Member

@rdimitrov rdimitrov commented May 12, 2026

Summary

  • Switch thv_reg_srv_servers_total and thv_reg_srv_skills_total from synchronous Int64Gauge to Int64ObservableGauge, with a callback that reads counts from the database at every scrape. The series no longer go stale once syncs quiesce.
  • Rename the misleading registry attribute to source — the value was always a source name. Bundled Grafana dashboard and docs/observability.md updated to match.
  • Count registry_entry (not entry_version), grouped over source LEFT JOIN registry_entry. This both matches the "latest only" semantics that consumer UIs show and includes kubernetes/managed source types that previously never appeared in the metrics because they bypass the sync coordinator.

Test plan

  • task gen && task lint-fix && task test
  • New unit tests cover happy-path observation and reader-error propagation through the OTel callback
  • New testcontainers integration test (TestDatabaseFactory_CreateRegistryMetricsReader) asserts: 1 entry with 2 versions ⇒ count 1, empty source ⇒ 0/0
  • Reviewer: any external Grafana queries / Prometheus alerts using the old registry label will need to switch to source

Fixes #783

rdimitrov added 2 commits May 12, 2026 14:02
Switch the two registry-count gauges from synchronous `Int64Gauge` to
`Int64ObservableGauge` and source their values from the database at
collection time. The previous implementation only recorded values inside
the success branch of `coordinator.performRegistrySync`, so once
`ShouldSync` returned up-to-date-no-policy the OTel Prometheus exporter
stopped seeing fresh observations and dropped the series.

The callback queries `source LEFT JOIN registry_entry` grouped by source
name, which also closes the gap where `kubernetes` and `managed` source
types never appeared in the metrics at all — they live in the `source`
table even though they bypass the sync coordinator. Counting
`registry_entry` rather than `entry_version` makes the gauge report
distinct servers/skills per source, matching what the consumer UIs show.

Rename the misleading `registry` attribute to `source` (the value was
always a source name, never a registry name) and update the bundled
Grafana dashboard and observability docs to match.

Fixes #783
`revive`'s `context-as-argument` linter requires `context.Context` to be
the first parameter of every function.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 53.57143% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.55%. Comparing base (845e3ee) to head (aad6afe).

Files with missing lines Patch % Lines
internal/app/builder.go 16.66% 9 Missing and 1 partial ⚠️
internal/app/storage/registry_metrics.go 55.55% 4 Missing and 4 partials ⚠️
internal/telemetry/metrics.go 75.00% 5 Missing and 1 partial ⚠️
internal/sync/coordinator/coordinator.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #784      +/-   ##
==========================================
+ Coverage   61.38%   61.55%   +0.16%     
==========================================
  Files         109      110       +1     
  Lines       10637    10667      +30     
==========================================
+ Hits         6530     6566      +36     
+ Misses       3527     3515      -12     
- Partials      580      586       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@danbarr danbarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics return as expected now in the demo-sandbox dashboard, LGTM!

@rdimitrov rdimitrov merged commit f049e34 into main May 12, 2026
15 checks passed
@rdimitrov rdimitrov deleted the faceted-cork branch May 12, 2026 16:15
danbarr added a commit to StacklokLabs/toolhive-demo-sandbox that referenced this pull request May 12, 2026
- Update to align with stacklok/toolhive-registry-server#784
- Add panels for skills by source and sync operations

Signed-off-by: Dan Barr <6922515+danbarr@users.noreply.github.com>
danbarr added a commit to StacklokLabs/toolhive-demo-sandbox that referenced this pull request May 12, 2026
- Update to align with stacklok/toolhive-registry-server#784
- Add panels for skills by source and sync operations

Signed-off-by: Dan Barr <6922515+danbarr@users.noreply.github.com>
Co-authored-by: Dan Barr <6922515+danbarr@users.noreply.github.com>
github-actions Bot added a commit to stacklok/docs-website that referenced this pull request May 12, 2026
The 'registry' attribute on thv_reg_srv_servers_total and
thv_reg_srv_skills_total was renamed to 'source' in
toolhive-registry-server v1.4.3 (stacklok/toolhive-registry-server#784).
Update the metrics reference table and add a note about the rename.
rdimitrov pushed a commit to stacklok/docs-website that referenced this pull request May 12, 2026
* Update stacklok/toolhive-registry-server to v1.4.3

Signed-off-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Update registry metrics labels for v1.4.3

The 'registry' attribute on thv_reg_srv_servers_total and
thv_reg_srv_skills_total was renamed to 'source' in
toolhive-registry-server v1.4.3 (stacklok/toolhive-registry-server#784).
Update the metrics reference table and add a note about the rename.

* Editorial polish for registry metrics note

Use parallel "per source" phrasing in the table, switch the v1.4.3
admonition to :::warning since it documents a breaking change to user
queries, and align terminology on "label" to match the table header.

---------

Signed-off-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

thv_reg_srv_skills_total and thv_reg_srv_servers_total are wrong: stale after sync quiesces, mislabeled as registry

3 participants