docs: add design doc for operator/operand repo split#1078
docs: add design doc for operator/operand repo split#1078Patryk-Stefanski wants to merge 6 commits into
Conversation
Covers repository boundary, operator-operand runtime contracts (config secret schema, status endpoint, deployment contract), CI/CD pipeline split, OLM/release pipeline considerations, phased migration plan, and full coupling analysis. Relates-to: #1020 Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughNew design document formalizing splitting the mcp-gateway monorepo into operand (mcp-gateway) and operator (mcp-gateway-operator) repos. It defines repo boundaries, the ChangesRepository Split Design
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/design/repo-split/repo-split-design.md (1)
350-350:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winEnsure the file ends with a trailing newline.
Line 350 appears to be EOF without a final newline; please add one to satisfy repo-wide file rules.
As per coding guidelines: "Files must end with a newline."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/repo-split/repo-split-design.md` at line 350, The file docs/design/repo-split/repo-split-design.md is missing a trailing newline at EOF; open the file and add a single newline character at the end so the file ends with a newline (ensure your editor/save settings preserve it), then commit the change so the file complies with the "Files must end with a newline" rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/repo-split/repo-split-design.md`:
- Around line 41-59: The fenced code blocks containing repository tree/layout
diagrams (the triple-backtick blocks showing "mcp-gateway/ ... go.mod" and the
other similar blocks around the same sections) are missing a language identifier
and should be updated to include one (e.g., add "text" after the opening ```);
locate the three markdown fenced blocks referenced in the diff (the tree/layout
blocks at the earlier block and the other two blocks noted in the comment) and
change their opening fences from ``` to ```text so the linter MD040 is
satisfied.
- Around line 7-349: The design doc docs/design/repo-split/repo-split-design.md
is missing mandated sections and subsections; add top-level sections "Problem",
"Goals/Non-Goals", "Job Stories", "Security Considerations", "Relationship to
Existing Approaches", "Future Considerations", and "Execution", and within the
existing "Design" section add subsections "Prerequisites", "Flow", "Component
Responsibilities", "API Changes", and "Data storage". Insert concise content
matching the repo-split context (referencing symbols like mcp-gateway,
mcp-gateway-operator, Config Secret `mcp-gateway-config`,
StatusResponse/ServerStatus structs, and the three-phase Migration Plan) so each
new section clearly states the issue, objectives, responsibilities, expected
API/format changes (e.g., configVersion and /status contract), storage impacts,
security considerations, and an explicit Execution plan aligned to the existing
Phase 1–3 migration steps.
- Around line 20-218: Add a mermaid flow diagram illustrating the
operator↔operand interactions (deployments, config secret write/read, status
call to GET /status, image/flag/env var/port contracts) and insert it into the
Design section (near "Operator-Operand Contract" or right after "Repository
Structure"); also add persona-based job stories (one-liners using the required
template "When <situation>, <actor> <context> wants <action/outcome> so that
<benefit>") covering the listed personas — Platform Engineer, MCP client user,
MCP Developer, non-interactive agent — and include both happy-path and at least
one edge-case story for each (e.g., unknown configVersion, disabled server,
missing env var, protocol version mismatch), ensuring each story references the
relevant contract pieces like the config secret (mcp-gateway-config), GET
/status, and deployment contract items.
---
Outside diff comments:
In `@docs/design/repo-split/repo-split-design.md`:
- Line 350: The file docs/design/repo-split/repo-split-design.md is missing a
trailing newline at EOF; open the file and add a single newline character at the
end so the file ends with a newline (ensure your editor/save settings preserve
it), then commit the change so the file complies with the "Files must end with a
newline" rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1c955975-7668-490c-ab20-346b152b0417
📒 Files selected for processing (1)
docs/design/repo-split/repo-split-design.md
Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
|
|
||
| #### Cross-Repo E2E | ||
|
|
||
| The operator repo's e2e tests pull the operand image by tag: |
There was a problem hiding this comment.
The issue is we want a bunch of e2e tests to execute on changes to the gateway not just on changes to the operator
There was a problem hiding this comment.
What if the operand repo's CI builds a temporary image from the PR, deploys the latest released operator (via Helm), and runs the operator repo's e2e suite against that image.
The e2e test definitions stay in the operator repo as the single source of truth — the operand CI pulls them at test time (git checkout or similar). The operator repo continues running the same suite on its own PRs against a released operand image.
This gives full e2e coverage on operand PRs without duplicating the test suite or adding cross-repo dispatch latency.
There was a problem hiding this comment.
Well yes that works in many cases. But it doesn't work when the operator and the operand change together.
This is not that un-common, by nature they are tightly coupled components, this is why the single monorepo works so well.
Without the mono repo you end up In a chicken and egg situation where a new flag is added or env var you end up with a chicken and egg situation. (operator needs new image of the operand to properly test , operand needs a new image of the operator to properly test).
To do this properly we probably need the following:
-
Helm based deployment in the operand repo that deploys the gateway (without operator)
-
Gateway focused e2e test suite in the operand repo that executes against functionality of the gateway (creates the config secret with expected config, sets the needed env vars / flags to enable the features etc under test , uses the test servers). Main functionality it tested here.
-
Helm based deployment in the operator repo that deploys the operator
-
Integration (controller envtest suite )in the operator repo that tests it created the deployment and secret etc correctly based on MCPServerReg , MCPGatewayExt etc...
-
Manually triggered job that deploys version x of the operand with version y of the operator. (could execute nightly also, and must be executed on release).
So to summerise:
Operand repo — tests the gateway as a gateway. Direct config secret, direct flags, no operator in the loop. This is where most feature work lands and where most e2e coverage lives.
Operator repo — tests the controller as a controller. envtest proves it produces correct Deployments/Secrets/RBAC from CRs. No real gateway needed.
Cross-repo job — the integration point. Manually triggered or nightly, catches version skew. Required gate before release.
Or we just keep it as a mono repo :)
There was a problem hiding this comment.
I added the e2e suggestion to the design doc, keeping it a mono repo I think is a wider discussion 😆
|
|
||
| ### OLM and Release Pipeline | ||
|
|
||
| CI is GitHub Actions-based — no Tekton pipelines to split. The OLM packaging artifacts (`bundle/`, `catalog/`, `bundle.Dockerfile`, `build/olm.mk`, `utils/generate-catalog.sh`) and AppStudio ReleasePlan resources all move to the operator repo. The operand repo has no OLM footprint. |
There was a problem hiding this comment.
Old name for konflux, will update.
…oduct build repos Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
| │ ├── mcp-system/ # K8s deployment manifests | ||
| │ ├── test-servers/ # K8s manifests for test servers | ||
| │ └── samples/ # example manifests | ||
| ├── tests/e2e/ # full e2e tests (pulls released operand image) |
| - The operator must not remove a field without a deprecation period spanning at least one operand release, preventing older operands from receiving a config missing a field they expect. | ||
| - New required fields must have sensible defaults so older operators that don't set them still produce valid config. | ||
| - Breaking changes require a coordinated release with both repos tagged. | ||
| - The config includes a top-level `configVersion: 1` field. The operand checks this on load — known versions are processed normally; unknown versions log a warning with the expected and actual values. This enables detection of operator/operand version mismatches without hard failures. |
There was a problem hiding this comment.
what is deciding the configVersion? Is that based on something like the operator version?
There was a problem hiding this comment.
It's a schema version, not tied to the operator release version. Renamed to schemaVersion to make that clearer.
It only bumps when the config secret format has a breaking structural change (field renames, type changes, removed fields). Additive changes like new optional fields don't bump it. This way the operand just checks "do I understand this config shape?" — a simple integer comparison — without needing to know anything about operator release versions. The operator can ship many releases without the schema version changing at all.
| | Health | HTTP | `GET /readyz` on port 8080 | | ||
| | Config mount | Volume | `/config/config.yaml` | | ||
|
|
||
| ### MCP Spec Considerations |
There was a problem hiding this comment.
I think we cab remove this section. It has no bearing on the the actual design imo
|
|
||
| A workflow in the operator repo that deploys both components together and runs full integration tests: | ||
|
|
||
| - **Nightly:** Latest released operator + latest released operand. Catches drift. |
There was a problem hiding this comment.
I think this should run main against main so build a "nigthly" or something like that tag and execute the :nightly operator with the :nightly operand
| A workflow in the operator repo that deploys both components together and runs full integration tests: | ||
|
|
||
| - **Nightly:** Latest released operator + latest released operand. Catches drift. | ||
| - **Manual:** Accepts image overrides for both components (e.g., operand PR image + operator PR image). Used for coordinated changes that touch both repos. |
There was a problem hiding this comment.
This manual one allows us to do something custom. Like run rc1 of the operand with v1.5 of the operator cause there were no changes to the operator code
|
|
||
| - **Operator repo** gets a new `Makefile` with controller-specific targets: `generate`, `manifests`, `docker-build`, `test-unit`, `test-controller-integration`, `lint`, `helm-*`, and CRD generation (`controller-gen`). | ||
| - **Operand repo** retains the existing `Makefile` but removes controller-specific targets (`docker-build-controller`, `generate`, `manifests`, CRD generation). Keeps `build`, `docker-build`, `test-unit`, `lint`, and performance test targets. | ||
|
|
There was a problem hiding this comment.
We need to have an operator make command that will deploy the operator with a sepecific version of the operand. So from the operand we could run make load-image and then `make deploy-operator OPERAND_IMAGE=xxxx:y
|
|
||
| ### Test Server Images | ||
|
|
||
| Test server source code (`tests/servers/`) stays in the operand repo and produces test server container images. The operator repo's e2e tests reference these images. The operator repo contains the Kubernetes manifests (`config/test-servers/`) that deploy them but does not build them — it pulls pre-built images from the operand repo's CI. |
There was a problem hiding this comment.
Do we still need tests/servers in the operator repo? Perhaps we just need one so we can test MCPServerRegistration?
maleck13
left a comment
There was a problem hiding this comment.
few more minor things but getting there
- Update operator repo structure (catalog, rbac, minimal test server) - Clarify configVersion is a schema version, not release version - Remove MCP Spec Considerations section (no bearing on design) - Nightly cross-repo job builds from main, not released images - Add make deploy-operator OPERAND_IMAGE= target to Makefile section - Reduce operator test servers to one minimal server for MCPServerRegistration Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
|
The doc covers which code moves where, but the user-facing docs boundary could be stated more explicitly. The "What Moves" table lists A few places where this could be clearer:
|
|
Let's hold this, pending alignment with Kuadrant/architecture#179 |
|
Yes it would be worth thinking about. Rather than splitting out the operator. That said Kuadrant is not currently a multi-tenant deployment but MCP-Gateway is. If we could put the effort we were going to put into this work into the umbrella operator approach that would likely be a better use of time |
|
cc @mikenairn |
|
/closing for now , superseded by #1198 |
Summary
mcp-gatewayinto two repos:mcp-gateway(operand) andmcp-gateway-operator(control-plane)Relates-to: #1020
Test plan
grep -rn 'api/v1alpha1' internal/broker/ cmd/mcp-broker-router/)internal/config/types.goSummary by CodeRabbit