remove controller->broker status polling for MCPServerRegistration#1187
remove controller->broker status polling for MCPServerRegistration#1187jasonmadigan wants to merge 5 commits into
Conversation
📝 WalkthroughWalkthroughRemoves broker-side status polling from MCPServerRegistration reconciliation, drops discoveredTools from the API/CRD, updates readiness docs and CI checks, adjusts config change detection, and updates router immediate-response headers and E2E assertions. ChangesMCPServerRegistration status, docs, and conformance checks
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
f090f8d to
4140ecb
Compare
|
hey @jasonmadigan sir rather than using portforwarding in confrmance tests why can't we use kubectl get --raw ? |
4bf4be4 to
b41c205
Compare
|
@malladinagarjuna2 port forwarding isn't used. this is a draft PR, there was a port forward while I was debugging something |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
.github/workflows/conformance.yaml (1)
113-120: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winReplace the fixed 60s delay with a real readiness poll.
This sleep is still guesswork: it can make CI slower when the broker is already ready, and it can still race on slower runners. Polling a runtime signal such as broker
/statusor tool availability would make this step much less flaky.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/conformance.yaml around lines 113 - 120, Replace the fixed sleep in the broker discovery step with an active readiness check. In the workflow block that currently logs “Waiting 60s...” and calls sleep, poll a real broker signal instead, such as the broker /status endpoint or a tool-availability check, and exit only when the broker reports ready. Keep the existing context around kubelet sync and tool discovery, but make the wait loop bounded with a timeout so the conformance job remains reliable and doesn’t rely on guesswork.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/backend-mcp-management.md`:
- Around line 115-121: Keep the storage-object terminology consistent across the
controller flow description and the readiness explanation in this design doc.
Update the relevant paragraphs and any referenced sequence/phase wording so they
all use the same object name as the controller implementation in the
MCPServerRegistration flow, especially around the readiness point in the
broker/controller lifecycle.
In `@internal/controller/mcpserverregistration_controller.go`:
- Around line 240-255: The readiness update in
mcpserverregistration_controller.go is marking the registration Ready even when
no namespace received a config write. Update the reconcile flow around the
upsert loop and updateStatus call so that Ready=True is only set when at least
one namespace was actually written to; if validNamespaces is empty (or no Secret
upsert occurred), leave the registration not ready and report an appropriate
non-success status/reason instead of "config written successfully". Use the
existing symbols updateStatus, validNamespaces, ServerStateDisabled, and
conditionReasonReady to locate and adjust the logic.
- Around line 257-259: The HTTPRoute status update failure is only being logged
in the reconcile path and then discarded, so the controller can report success
with stale status. Update the reconciliation flow around updateHTTPRouteStatus
in mcpserverregistration_controller.go to return the error (or otherwise fail
the reconcile) after logging it, so transient API/conflict failures trigger a
retry and keep the Programmed condition consistent.
In `@tests/e2e/happy_path_test.go`:
- Around line 236-252: The session ID check in the Eventually block is using
mcpsessionid from outside the retry closure, so stale values can carry across
attempts; update the test around mcpClient.CallTool and the Mcp-Session-Id
assertion to use a per-attempt local variable inside the closure, then assign it
back only after the substring check succeeds.
---
Nitpick comments:
In @.github/workflows/conformance.yaml:
- Around line 113-120: Replace the fixed sleep in the broker discovery step with
an active readiness check. In the workflow block that currently logs “Waiting
60s...” and calls sleep, poll a real broker signal instead, such as the broker
/status endpoint or a tool-availability check, and exit only when the broker
reports ready. Keep the existing context around kubelet sync and tool discovery,
but make the wait loop bounded with a timeout so the conformance job remains
reliable and doesn’t rely on guesswork.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2716b232-5fb0-43e9-b566-4c9005b6aa42
⛔ Files ignored due to path filters (2)
charts/mcp-gateway/crds/mcp.kuadrant.io_mcpserverregistrations.yamlis excluded by!charts/mcp-gateway/crds/**config/crd/mcp.kuadrant.io_mcpserverregistrations.yamlis excluded by!config/crd/mcp.kuadrant.io_*.yaml
📒 Files selected for processing (16)
.github/workflows/conformance.yamlapi/v1alpha1/types.gobundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yamldocs/design/backend-mcp-management.mddocs/reference/mcpserverregistration.mdinternal/config/config_writer.gointernal/controller/mcpserverregistration_controller.gointernal/controller/server_validator.gointernal/controller/server_validator_test.gointernal/mcp-router/request_handlers.gointernal/mcp-router/response_builder.gointernal/mcp-router/response_builder_test.gointernal/mcp-router/server_test.gotests/e2e/custom_tls_test.gotests/e2e/happy_path_test.gotests/e2e/verifiers.go
💤 Files with no reviewable changes (6)
- internal/controller/server_validator.go
- api/v1alpha1/types.go
- internal/controller/server_validator_test.go
- docs/reference/mcpserverregistration.md
- tests/e2e/verifiers.go
- bundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yaml
f0b2902 to
0a327c6
Compare
…uadrant#629) Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
…robe Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
0a327c6 to
742d612
Compare
Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
|
👀 |
david-martin
left a comment
There was a problem hiding this comment.
The ConfigChanged thing looks like a good catch, and possible source of timing based bugs.
Not a blocker, but could the /status endpoint be helpful in reducing test time instead of Eventuallys, or not much time gained overall?
A few docs flagged as needing updates:
docs/guides/register-mcp-servers.md and docs/guides/external-mcp-server.md still show a TOOLS column in
kubectl get mcpsr output, and register-mcp-servers.md says "the broker needs a moment to connect and
discover tools and prompts" when describing what Ready means. Both need updating for the new semantics.
very likely yes
I went back and forth on this a few times. pushed updates for the docs too, good catches |
Signed-off-by: Jason Madigan <jason@jasonmadigan.com>
554977a to
37ef576
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/config/mcpservers_test.go (1)
224-239: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winadd a
UserSpecificListregression case alongside this oneThis update covers the URL path, but
ConfigChangednow also treatsUserSpecificListchanges as config-impacting. Adding that table row here would keep the skip-write contract covered from both new branches.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/config/mcpservers_test.go` around lines 224 - 239, Add a regression table case in mcpservers_test.go alongside the existing URL change case to cover ConfigChanged behavior for UserSpecificList. Extend the MCPServer test matrix with matching current/existing entries that differ only in UserSpecificList, and assert expectChanged is true so the skip-write contract is covered for this new config-impacting branch.internal/config/types.go (1)
124-125: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winkeep this comment non-exhaustive or make it complete
Line 124 is already stale again: the predicate also treats
CACert,UserSpecificList, andTokenURLElicitationchanges as config changes. Either drop the field list or make it match the full comparison set.As per coding guidelines,
**/*.go: Minimal, DRY, terse comments (lowercase, only when necessary).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/config/types.go` around lines 124 - 125, The comment above MCPServer.ConfigChanged is stale and too specific because the comparison now includes additional fields beyond the listed ones. Update the comment to either be intentionally non-exhaustive or fully enumerate the current config-change checks, including CACert, UserSpecificList, and TokenURLElicitation, and keep it terse and lowercase to match the style guidelines.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/config/mcpservers_test.go`:
- Around line 224-239: Add a regression table case in mcpservers_test.go
alongside the existing URL change case to cover ConfigChanged behavior for
UserSpecificList. Extend the MCPServer test matrix with matching
current/existing entries that differ only in UserSpecificList, and assert
expectChanged is true so the skip-write contract is covered for this new
config-impacting branch.
In `@internal/config/types.go`:
- Around line 124-125: The comment above MCPServer.ConfigChanged is stale and
too specific because the comparison now includes additional fields beyond the
listed ones. Update the comment to either be intentionally non-exhaustive or
fully enumerate the current config-change checks, including CACert,
UserSpecificList, and TokenURLElicitation, and keep it terse and lowercase to
match the style guidelines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f0627ec7-75c8-4536-a61b-2ab96757bdfc
⛔ Files ignored due to path filters (2)
charts/mcp-gateway/crds/mcp.kuadrant.io_mcpserverregistrations.yamlis excluded by!charts/mcp-gateway/crds/**config/crd/mcp.kuadrant.io_mcpserverregistrations.yamlis excluded by!config/crd/mcp.kuadrant.io_*.yaml
📒 Files selected for processing (20)
.github/workflows/conformance.yamlapi/v1alpha1/types.gobundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yamldocs/design/backend-mcp-management.mddocs/guides/external-mcp-server.mddocs/guides/register-mcp-servers.mddocs/reference/mcpserverregistration.mdinternal/config/config_writer.gointernal/config/mcpservers_test.gointernal/config/types.gointernal/controller/mcpserverregistration_controller.gointernal/controller/server_validator.gointernal/controller/server_validator_test.gointernal/mcp-router/request_handlers.gointernal/mcp-router/response_builder.gointernal/mcp-router/response_builder_test.gointernal/mcp-router/server_test.gotests/e2e/custom_tls_test.gotests/e2e/happy_path_test.gotests/e2e/verifiers.go
💤 Files with no reviewable changes (6)
- internal/controller/server_validator_test.go
- docs/reference/mcpserverregistration.md
- tests/e2e/verifiers.go
- bundle/manifests/mcp.kuadrant.io_mcpserverregistrations.yaml
- api/v1alpha1/types.go
- internal/controller/server_validator.go
✅ Files skipped from review due to trivial changes (3)
- docs/guides/external-mcp-server.md
- docs/guides/register-mcp-servers.md
- docs/design/backend-mcp-management.md
🚧 Files skipped from review as they are similar to previous changes (9)
- internal/mcp-router/response_builder_test.go
- internal/mcp-router/server_test.go
- internal/mcp-router/response_builder.go
- internal/config/config_writer.go
- internal/mcp-router/request_handlers.go
- .github/workflows/conformance.yaml
- tests/e2e/custom_tls_test.go
- tests/e2e/happy_path_test.go
- internal/controller/mcpserverregistration_controller.go
Closes #629
Summary
Rips out the controller-to-broker status polling. The controller was setting MCPServerRegistration Ready only after HTTP-polling the broker's
/statusendpoint, which was unreliable: tool counts went stale, protocol validation lagged behind reality. Now the controller sets Ready once the config secret is written. Runtime status lives in the broker's/statusendpoint where it belongs.Removing the polling exposed two pre-existing bugs that were hiding behind the polling delay:
ext_proc
ValuevsRawValue: Since Envoy 1.27,send_header_raw_valueis enabled by default. Envoy readsRawValue(bytes) and ignoresValue(string). The router'sWithImmediateResponseandWithImmediateJSONRPCResponsewere usingValue, producing empty Content-Type and mcp-session-id headers on all error responses. TheHeadersBuilderused by non-error paths already usedRawValuecorrectly. This was latent because the old polling delay meant tools were always registered by the time a call arrived, so the error paths never actually ran in practice.Redundant config writes:
UpsertMCPServerwrote the Secret on every reconcile even when content was unchanged, bumpingresourceVersioneach time. Each bump triggered kubelet volume sync and broker config reload, which could restart MCPManagers and temporarily empty the tool registry. Added aConfigChangedguard to skip no-op writes, and fixed the guard to include URL comparison -- without it, changes tospec.pathor backend port would silently produce stale config.What changed
Controller (-280 lines):
ServerValidator(HTTP polling to broker)updateStatus: removedtoolCountparam, Ready on config writeDiscoveredToolsfrom CRD status, regenerated CRDsRouter (bug fix):
WithImmediateResponse/WithImmediateJSONRPCResponse:ValuetoRawValuemcp-session-idheaders:ValuetoRawValueContent-Type: text/plaintoWithImmediateResponseConfig writer:
UpsertMCPServerskips write when server config unchangedConfigChangedincludes URL to prevent stale config on path/port changesE2e tests:
Conformance workflow:
sleep 10with broker/statuspolling to wait for tool discovery after kubelet volume syncDocs: removed
discoveredToolsfrom API ref, clarified status ownership in design docTest evidence
make test-unit: passmake test-controller-integration: 41/41Summary by CodeRabbit
New Features
Bug Fixes
Documentation