Skip to content

fix(broker): report served tool/prompt count on upstream status error path#1165

Draft
namansh70747 wants to merge 1 commit into
Kuadrant:mainfrom
namansh70747:fix/stale-tool-count-on-status-error
Draft

fix(broker): report served tool/prompt count on upstream status error path#1165
namansh70747 wants to merge 1 commit into
Kuadrant:mainfrom
namansh70747:fix/stale-tool-count-on-status-error

Conversation

@namansh70747

@namansh70747 namansh70747 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Fixes #1164

When an upstream connection or ping fails, manage() removes the server's tools and prompts and then calls setStatus to mark it not ready. setStatus returned early on the error path without updating TotalTools/TotalPrompts, so the status kept the count from the last healthy cycle. The controller surfaces that value as status.discoveredTools (the Tools printer column), so kubectl get mcpserverregistration showed a server as Ready=False while still listing its old tool count.

This reports the count actually being served on the error path:

  • 0 when the tools were removed (connect/ping failure, rejected server)
  • the cached count when a transient tools/list error leaves the previously served set in place (default FilterOut policy), so a momentary list blip doesn't wrongly report 0 while tools are still served

The read is guarded by the existing toolsLock, which no setStatus caller holds, so there's no deadlock. It lines up with the success path, where toolCount == len(man.tools).

Test: TestMCPManager_setStatus_ErrorReportsServedCount covers both cases (removed → 0, still served → keeps count). It fails on the old behaviour and passes with the fix.

Noted on the issue that the MCPServerRegistration/Status endpoint coupling is changing soon — happy for this to be superseded by that work; sending it as a small interim fix.

Summary by CodeRabbit

  • Bug Fixes
    • Improved accuracy of tool and prompt counts during error states, ensuring proper status reporting when the system encounters operational issues.

Copilot AI review requested due to automatic review settings June 19, 2026 07:06
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d79cb766-ee17-41cd-a2f3-9dfa264257de

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

In MCPManager.setStatus, the error path now reads len(man.tools) and len(man.prompts) under toolsLock read lock and assigns them to status.TotalTools/TotalPrompts before returning. A new test covers both the cleared-count and preserved-count scenarios.

Changes

setStatus error-path count fix

Layer / File(s) Summary
Error-path count update and test
internal/broker/upstream/manager.go, internal/broker/upstream/manager_test.go
setStatus now assigns TotalTools/TotalPrompts from the live cached lengths under toolsLock on the error path instead of leaving stale values. New test asserts counts clear to 0 when tools are removed and remain accurate on transient errors.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

review-effort/medium

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly describes the main change: updating setStatus to report accurate tool/prompt counts on error paths, matching the PR's core objective.
Linked Issues check ✅ Passed Code changes fully address #1164 by updating setStatus to report actual served counts (0 when removed, cached count when still served) under toolsLock protection.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the identified bug: setStatus error path logic and corresponding test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot added the review-effort/medium Medium review effort (3): few files, moderate logic label Jun 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes broker status reporting so that when an upstream goes unhealthy, the MCP manager reports the tool/prompt counts that are actually being served (instead of leaving stale counts from the last healthy cycle). This aligns the Ready=False state with the TotalTools/TotalPrompts values the controller surfaces on MCPServerRegistration.

Changes:

  • Update MCPManager.setStatus() to populate TotalTools/TotalPrompts on the error path by reading the currently served cached sets under toolsLock.
  • Add a unit test covering both error scenarios: tools removed (counts drop to 0) vs transient list failure (counts remain accurate).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
internal/broker/upstream/manager.go On setStatus error path, report served tool/prompt counts under toolsLock to avoid stale status values.
internal/broker/upstream/manager_test.go Adds coverage to ensure error-path status reports 0 after removal and preserves counts when cached tools remain served.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/broker/upstream/manager_test.go`:
- Around line 418-429: The test in the "keeps count when tools are still served"
function validates that TotalTools count is preserved during a transient error,
but does not validate the same behavior for TotalPrompts. Add initialization of
manager.prompts with a slice containing a specific number of elements (similar
to how manager.tools is initialized) and add an assertion to verify that
manager.status.TotalPrompts matches the expected count after calling setStatus,
ensuring the prompt count is also preserved when errors occur.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c810d80-ac1a-48b9-9b3f-36446277225a

📥 Commits

Reviewing files that changed from the base of the PR and between f7dfb26 and 78565c0.

📒 Files selected for processing (2)
  • internal/broker/upstream/manager.go
  • internal/broker/upstream/manager_test.go

Comment on lines +418 to +429
t.Run("keeps count when tools are still served", func(t *testing.T) {
mock := newMockMCP("test-server", "test_")
manager, err := NewUpstreamMCPManager(mock, newMockToolsAdderDeleter(), nil, logger, 0, mcpv1alpha1.InvalidToolPolicyFilterOut)
require.NoError(t, err)
// a transient tools/list error leaves the previously served set in place
manager.tools = make([]mcp.Tool, 3)

manager.setStatus(fmt.Errorf("list failed"), 3, 0, nil, nil)

assert.False(t, manager.status.Ready)
assert.Equal(t, 3, manager.status.TotalTools, "still-served tools should keep their count")
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Cover prompt-count behavior in transient-error case

Line 418-429 validates only TotalTools, but the fixed branch also updates TotalPrompts. Seed manager.prompts and assert manager.status.TotalPrompts to lock this contract down.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/broker/upstream/manager_test.go` around lines 418 - 429, The test in
the "keeps count when tools are still served" function validates that TotalTools
count is preserved during a transient error, but does not validate the same
behavior for TotalPrompts. Add initialization of manager.prompts with a slice
containing a specific number of elements (similar to how manager.tools is
initialized) and add an assertion to verify that manager.status.TotalPrompts
matches the expected count after calling setStatus, ensuring the prompt count is
also preserved when errors occur.

@namansh70747

Copy link
Copy Markdown
Contributor Author

Thanks again for the steer on #1164, @maleck13 — put the small fix up here whenever you get a chance. No rush.

@jasonmadigan

jasonmadigan commented Jun 23, 2026

Copy link
Copy Markdown
Member

Thanks for filing the issue and the fix. This is part of a broader change we're planning internally (#629) which will remove the status syncing that causes this staleness. Since that work is already assigned to a maintainer, we'll address it there. Generally, if an issue is assigned it's best to check before sending a PR, saves duplicated effort. Appreciate you flagging it though.

oops, this was closed prematurely - I think this is fine as a follow on to #1187 when that lands. keeping open

… path

When an upstream connection or ping fails, manage() removes the server's
tools and prompts and then calls setStatus to mark it not ready. setStatus
returned early on the error path without updating TotalTools/TotalPrompts,
so the status kept the count from the last healthy cycle. The controller
surfaces that value as status.discoveredTools (the Tools printer column),
so kubectl get mcpserverregistration showed a server as not ready while
still listing its old tool count.

Report the count actually being served on the error path: 0 when the tools
were removed (connect/ping failure, rejected server) and the cached count
when a transient tools/list error leaves the previously served set in place.

Signed-off-by: Naman Sharma <namsh70747@gmail.com>
@namansh70747 namansh70747 force-pushed the fix/stale-tool-count-on-status-error branch from 78565c0 to 9300947 Compare June 23, 2026 16:40
@namansh70747

Copy link
Copy Markdown
Contributor Author

referencing #1164 as the linked bug — that one's still triage/needs-triage; if accepting it unblocks the triage/needs-issue label here, happy to wait. otherwise let me know if a separate issue would be cleaner.

also addressed coderabbit's comment: seeded manager.prompts and added the TotalPrompts assertion in the transient-error sub-test to lock that contract down too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-effort/medium Medium review effort (3): few files, moderate logic triage/needs-issue PR needs a linked issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

broker: MCPServerRegistration keeps reporting a stale tool count after an upstream goes down

4 participants