Skip to content

remove controller->broker status polling for MCPServerRegistration#1187

Open
jasonmadigan wants to merge 5 commits into
Kuadrant:mainfrom
jasonmadigan:629-remove-status-polling
Open

remove controller->broker status polling for MCPServerRegistration#1187
jasonmadigan wants to merge 5 commits into
Kuadrant:mainfrom
jasonmadigan:629-remove-status-polling

Conversation

@jasonmadigan

@jasonmadigan jasonmadigan commented Jun 23, 2026

Copy link
Copy Markdown
Member

Closes #629

Summary

Rips out the controller-to-broker status polling. The controller was setting MCPServerRegistration Ready only after HTTP-polling the broker's /status endpoint, which was unreliable: tool counts went stale, protocol validation lagged behind reality. Now the controller sets Ready once the config secret is written. Runtime status lives in the broker's /status endpoint where it belongs.

Removing the polling exposed two pre-existing bugs that were hiding behind the polling delay:

ext_proc Value vs RawValue: Since Envoy 1.27, send_header_raw_value is enabled by default. Envoy reads RawValue (bytes) and ignores Value (string). The router's WithImmediateResponse and WithImmediateJSONRPCResponse were using Value, producing empty Content-Type and mcp-session-id headers on all error responses. The HeadersBuilder used by non-error paths already used RawValue correctly. This was latent because the old polling delay meant tools were always registered by the time a call arrived, so the error paths never actually ran in practice.

Redundant config writes: UpsertMCPServer wrote the Secret on every reconcile even when content was unchanged, bumping resourceVersion each time. Each bump triggered kubelet volume sync and broker config reload, which could restart MCPManagers and temporarily empty the tool registry. Added a ConfigChanged guard to skip no-op writes, and fixed the guard to include URL comparison -- without it, changes to spec.path or backend port would silently produce stale config.

What changed

Controller (-280 lines):

  • Deleted ServerValidator (HTTP polling to broker)
  • Simplified updateStatus: removed toolCount param, Ready on config write
  • Removed DiscoveredTools from CRD status, regenerated CRDs

Router (bug fix):

  • WithImmediateResponse/WithImmediateJSONRPCResponse: Value to RawValue
  • Error-path mcp-session-id headers: Value to RawValue
  • Added Content-Type: text/plain to WithImmediateResponse

Config writer:

  • UpsertMCPServer skips write when server config unchanged
  • ConfigChanged includes URL to prevent stale config on path/port changes

E2e tests:

  • Protocol validation, conflict, TLS negative, unavailable server tests rewritten for new semantics (controller no longer reports broker-side errors)
  • Tool count verifiers removed

Conformance workflow:

  • Replaced sleep 10 with broker /status polling to wait for tool discovery after kubelet volume sync

Docs: removed discoveredTools from API ref, clarified status ownership in design doc

Test evidence

  • make test-unit: pass
  • make test-controller-integration: 41/41
  • CI e2e: 51/51
  • CI conformance: 10/10

Summary by CodeRabbit

  • New Features

    • Registration status now highlights category instead of tool count.
    • Tool discovery is now checked asynchronously after the gateway becomes ready.
  • Bug Fixes

    • Improved readiness handling so registrations report ready as soon as configuration is written.
    • Reduced unnecessary updates when server configuration hasn’t changed.
    • Immediate error responses now return clearer content-type headers.
  • Documentation

    • Updated guides and reference docs to match the new readiness and status behavior.
    • Clarified where to check tool availability and what status information means.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Removes broker-side status polling from MCPServerRegistration reconciliation, drops discoveredTools from the API/CRD, updates readiness docs and CI checks, adjusts config change detection, and updates router immediate-response headers and E2E assertions.

Changes

MCPServerRegistration status, docs, and conformance checks

Layer / File(s) Summary
Status contract and CI timing
api/v1alpha1/types.go, bundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yaml, docs/design/backend-mcp-management.md, docs/guides/external-mcp-server.md, docs/guides/register-mcp-servers.md, docs/reference/mcpserverregistration.md, .github/workflows/conformance.yaml
Removes the discoveredTools status field and Tools printer column, adds a Category printer column in the CRD, updates readiness and status docs, and changes the conformance workflow to wait for broker tool discovery and collect broker/router/Envoy logs on failure.
Config change detection and write skipping
internal/config/config_writer.go, internal/config/types.go, internal/config/mcpservers_test.go
ConfigChanged now treats URL differences as changes, and UpsertMCPServer skips Secret updates when the config is unchanged.
Controller writes Ready immediately
internal/controller/mcpserverregistration_controller.go, internal/controller/server_validator.go, internal/controller/server_validator_test.go
Removes broker polling and the ServerValidator, updates status immediately after config writes, removes tool-count handling from updateStatus, and preserves LastTransitionTime when condition status is unchanged.
Immediate response headers use RawValue
internal/mcp-router/request_handlers.go, internal/mcp-router/response_builder.go, internal/mcp-router/response_builder_test.go, internal/mcp-router/server_test.go
Switches immediate-response header construction to RawValue for mcp-session-id and content-type, and updates tests to expect the text/plain header mutation.
E2E readiness and runtime assertions
tests/e2e/verifiers.go, tests/e2e/happy_path_test.go, tests/e2e/custom_tls_test.go
Removes the tools-count readiness helper and updates end-to-end tests to use plain readiness checks, broker /status verification, and revised runtime assertions for TLS, conflicts, session reuse, notifications, and server unavailability.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: removing controller-to-broker status polling for MCPServerRegistration.
Linked Issues check ✅ Passed The controller now marks MCPServerRegistration ready after config write and removes broker status polling, matching #629's desired behavior.
Out of Scope Changes check ✅ Passed The remaining changes are tied to the status-polling removal, related test/doc updates, or the accompanying config and header fixes.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@malladinagarjuna2

malladinagarjuna2 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

hey @jasonmadigan sir rather than using portforwarding in confrmance tests why can't we use kubectl get --raw ?

@jasonmadigan jasonmadigan force-pushed the 629-remove-status-polling branch 3 times, most recently from 4bf4be4 to b41c205 Compare June 24, 2026 02:40
@jasonmadigan

Copy link
Copy Markdown
Member Author

@malladinagarjuna2 port forwarding isn't used. this is a draft PR, there was a port forward while I was debugging something

@jasonmadigan jasonmadigan marked this pull request as ready for review June 24, 2026 08:03
@coderabbitai coderabbitai Bot added high-risk Touches concurrency, auth, sessions, CRDs, ext_proc, or routing review-effort/large High review effort (4-5): many files, complex, cross-cutting labels Jun 24, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
.github/workflows/conformance.yaml (1)

113-120: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Replace the fixed 60s delay with a real readiness poll.

This sleep is still guesswork: it can make CI slower when the broker is already ready, and it can still race on slower runners. Polling a runtime signal such as broker /status or tool availability would make this step much less flaky.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/conformance.yaml around lines 113 - 120, Replace the fixed
sleep in the broker discovery step with an active readiness check. In the
workflow block that currently logs “Waiting 60s...” and calls sleep, poll a real
broker signal instead, such as the broker /status endpoint or a
tool-availability check, and exit only when the broker reports ready. Keep the
existing context around kubelet sync and tool discovery, but make the wait loop
bounded with a timeout so the conformance job remains reliable and doesn’t rely
on guesswork.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/backend-mcp-management.md`:
- Around line 115-121: Keep the storage-object terminology consistent across the
controller flow description and the readiness explanation in this design doc.
Update the relevant paragraphs and any referenced sequence/phase wording so they
all use the same object name as the controller implementation in the
MCPServerRegistration flow, especially around the readiness point in the
broker/controller lifecycle.

In `@internal/controller/mcpserverregistration_controller.go`:
- Around line 240-255: The readiness update in
mcpserverregistration_controller.go is marking the registration Ready even when
no namespace received a config write. Update the reconcile flow around the
upsert loop and updateStatus call so that Ready=True is only set when at least
one namespace was actually written to; if validNamespaces is empty (or no Secret
upsert occurred), leave the registration not ready and report an appropriate
non-success status/reason instead of "config written successfully". Use the
existing symbols updateStatus, validNamespaces, ServerStateDisabled, and
conditionReasonReady to locate and adjust the logic.
- Around line 257-259: The HTTPRoute status update failure is only being logged
in the reconcile path and then discarded, so the controller can report success
with stale status. Update the reconciliation flow around updateHTTPRouteStatus
in mcpserverregistration_controller.go to return the error (or otherwise fail
the reconcile) after logging it, so transient API/conflict failures trigger a
retry and keep the Programmed condition consistent.

In `@tests/e2e/happy_path_test.go`:
- Around line 236-252: The session ID check in the Eventually block is using
mcpsessionid from outside the retry closure, so stale values can carry across
attempts; update the test around mcpClient.CallTool and the Mcp-Session-Id
assertion to use a per-attempt local variable inside the closure, then assign it
back only after the substring check succeeds.

---

Nitpick comments:
In @.github/workflows/conformance.yaml:
- Around line 113-120: Replace the fixed sleep in the broker discovery step with
an active readiness check. In the workflow block that currently logs “Waiting
60s...” and calls sleep, poll a real broker signal instead, such as the broker
/status endpoint or a tool-availability check, and exit only when the broker
reports ready. Keep the existing context around kubelet sync and tool discovery,
but make the wait loop bounded with a timeout so the conformance job remains
reliable and doesn’t rely on guesswork.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2716b232-5fb0-43e9-b566-4c9005b6aa42

📥 Commits

Reviewing files that changed from the base of the PR and between 3b36b81 and b41c205.

⛔ Files ignored due to path filters (2)
  • charts/mcp-gateway/crds/mcp.kuadrant.io_mcpserverregistrations.yaml is excluded by !charts/mcp-gateway/crds/**
  • config/crd/mcp.kuadrant.io_mcpserverregistrations.yaml is excluded by !config/crd/mcp.kuadrant.io_*.yaml
📒 Files selected for processing (16)
  • .github/workflows/conformance.yaml
  • api/v1alpha1/types.go
  • bundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yaml
  • docs/design/backend-mcp-management.md
  • docs/reference/mcpserverregistration.md
  • internal/config/config_writer.go
  • internal/controller/mcpserverregistration_controller.go
  • internal/controller/server_validator.go
  • internal/controller/server_validator_test.go
  • internal/mcp-router/request_handlers.go
  • internal/mcp-router/response_builder.go
  • internal/mcp-router/response_builder_test.go
  • internal/mcp-router/server_test.go
  • tests/e2e/custom_tls_test.go
  • tests/e2e/happy_path_test.go
  • tests/e2e/verifiers.go
💤 Files with no reviewable changes (6)
  • internal/controller/server_validator.go
  • api/v1alpha1/types.go
  • internal/controller/server_validator_test.go
  • docs/reference/mcpserverregistration.md
  • tests/e2e/verifiers.go
  • bundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yaml

Comment thread docs/design/backend-mcp-management.md
Comment thread internal/controller/mcpserverregistration_controller.go
Comment thread internal/controller/mcpserverregistration_controller.go
Comment thread tests/e2e/happy_path_test.go
@jasonmadigan jasonmadigan force-pushed the 629-remove-status-polling branch 4 times, most recently from f0b2902 to 0a327c6 Compare June 24, 2026 09:49
Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
…robe

Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
@jasonmadigan jasonmadigan force-pushed the 629-remove-status-polling branch from 0a327c6 to 742d612 Compare June 24, 2026 09:55
Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
@david-martin

Copy link
Copy Markdown
Member

👀

@david-martin david-martin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConfigChanged thing looks like a good catch, and possible source of timing based bugs.

Not a blocker, but could the /status endpoint be helpful in reducing test time instead of Eventuallys, or not much time gained overall?

A few docs flagged as needing updates:
docs/guides/register-mcp-servers.md and docs/guides/external-mcp-server.md still show a TOOLS column in
kubectl get mcpsr output, and register-mcp-servers.md says "the broker needs a moment to connect and
discover tools and prompts" when describing what Ready means. Both need updating for the new semantics.

@jasonmadigan

jasonmadigan commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

The ConfigChanged thing looks like a good catch, and possible source of timing based bugs.

very likely yes

Not a blocker, but could the /status endpoint be helpful in reducing test time instead of Eventuallys, or not much time gained overall?

I went back and forth on this a few times. Checking via the ep is probably better, but I was wary of e2e tests interacting with APIs under the covers. on reflection though, that ep is public, and with this change we're recommending folks use it to debug, so having extra test coverage for it via this is no bad thing? I've amended. stuck with the existing checks, as I didn't really want add another port-forward...

pushed updates for the docs too, good catches

Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
@jasonmadigan jasonmadigan force-pushed the 629-remove-status-polling branch from 554977a to 37ef576 Compare June 26, 2026 16:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/config/mcpservers_test.go (1)

224-239: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

add a UserSpecificList regression case alongside this one

This update covers the URL path, but ConfigChanged now also treats UserSpecificList changes as config-impacting. Adding that table row here would keep the skip-write contract covered from both new branches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/config/mcpservers_test.go` around lines 224 - 239, Add a regression
table case in mcpservers_test.go alongside the existing URL change case to cover
ConfigChanged behavior for UserSpecificList. Extend the MCPServer test matrix
with matching current/existing entries that differ only in UserSpecificList, and
assert expectChanged is true so the skip-write contract is covered for this new
config-impacting branch.
internal/config/types.go (1)

124-125: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

keep this comment non-exhaustive or make it complete

Line 124 is already stale again: the predicate also treats CACert, UserSpecificList, and TokenURLElicitation changes as config changes. Either drop the field list or make it match the full comparison set.

As per coding guidelines, **/*.go: Minimal, DRY, terse comments (lowercase, only when necessary).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/config/types.go` around lines 124 - 125, The comment above
MCPServer.ConfigChanged is stale and too specific because the comparison now
includes additional fields beyond the listed ones. Update the comment to either
be intentionally non-exhaustive or fully enumerate the current config-change
checks, including CACert, UserSpecificList, and TokenURLElicitation, and keep it
terse and lowercase to match the style guidelines.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/config/mcpservers_test.go`:
- Around line 224-239: Add a regression table case in mcpservers_test.go
alongside the existing URL change case to cover ConfigChanged behavior for
UserSpecificList. Extend the MCPServer test matrix with matching
current/existing entries that differ only in UserSpecificList, and assert
expectChanged is true so the skip-write contract is covered for this new
config-impacting branch.

In `@internal/config/types.go`:
- Around line 124-125: The comment above MCPServer.ConfigChanged is stale and
too specific because the comparison now includes additional fields beyond the
listed ones. Update the comment to either be intentionally non-exhaustive or
fully enumerate the current config-change checks, including CACert,
UserSpecificList, and TokenURLElicitation, and keep it terse and lowercase to
match the style guidelines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0627ec7-75c8-4536-a61b-2ab96757bdfc

📥 Commits

Reviewing files that changed from the base of the PR and between b41c205 and 37ef576.

⛔ Files ignored due to path filters (2)
  • charts/mcp-gateway/crds/mcp.kuadrant.io_mcpserverregistrations.yaml is excluded by !charts/mcp-gateway/crds/**
  • config/crd/mcp.kuadrant.io_mcpserverregistrations.yaml is excluded by !config/crd/mcp.kuadrant.io_*.yaml
📒 Files selected for processing (20)
  • .github/workflows/conformance.yaml
  • api/v1alpha1/types.go
  • bundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yaml
  • docs/design/backend-mcp-management.md
  • docs/guides/external-mcp-server.md
  • docs/guides/register-mcp-servers.md
  • docs/reference/mcpserverregistration.md
  • internal/config/config_writer.go
  • internal/config/mcpservers_test.go
  • internal/config/types.go
  • internal/controller/mcpserverregistration_controller.go
  • internal/controller/server_validator.go
  • internal/controller/server_validator_test.go
  • internal/mcp-router/request_handlers.go
  • internal/mcp-router/response_builder.go
  • internal/mcp-router/response_builder_test.go
  • internal/mcp-router/server_test.go
  • tests/e2e/custom_tls_test.go
  • tests/e2e/happy_path_test.go
  • tests/e2e/verifiers.go
💤 Files with no reviewable changes (6)
  • internal/controller/server_validator_test.go
  • docs/reference/mcpserverregistration.md
  • tests/e2e/verifiers.go
  • bundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yaml
  • api/v1alpha1/types.go
  • internal/controller/server_validator.go
✅ Files skipped from review due to trivial changes (3)
  • docs/guides/external-mcp-server.md
  • docs/guides/register-mcp-servers.md
  • docs/design/backend-mcp-management.md
🚧 Files skipped from review as they are similar to previous changes (9)
  • internal/mcp-router/response_builder_test.go
  • internal/mcp-router/server_test.go
  • internal/mcp-router/response_builder.go
  • internal/config/config_writer.go
  • internal/mcp-router/request_handlers.go
  • .github/workflows/conformance.yaml
  • tests/e2e/custom_tls_test.go
  • tests/e2e/happy_path_test.go
  • internal/controller/mcpserverregistration_controller.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high-risk Touches concurrency, auth, sessions, CRDs, ext_proc, or routing review-effort/large High review effort (4-5): many files, complex, cross-cutting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tech Debt - Improve how checking the status of the MCPServers with the gateway is handled

3 participants