Skip to content

docs: add design doc for operator/operand repo split#1078

Closed
Patryk-Stefanski wants to merge 6 commits into
mainfrom
worktree-1020-split-design
Closed

docs: add design doc for operator/operand repo split#1078
Patryk-Stefanski wants to merge 6 commits into
mainfrom
worktree-1020-split-design

Conversation

@Patryk-Stefanski

@Patryk-Stefanski Patryk-Stefanski commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Design doc for splitting mcp-gateway into two repos: mcp-gateway (operand) and mcp-gateway-operator (control-plane)
  • Covers repository boundary, runtime contracts (config secret, status endpoint, deployment), CI/CD split, OLM/release pipeline, and phased migration plan
  • Full coupling analysis: 10 compile-time coupling points (5 controller→operand, 5 operand→CRD) to resolve in Phase 1

Relates-to: #1020

Test plan

  • Review design doc for technical accuracy
  • Verify coupling analysis matches current codebase (grep -rn 'api/v1alpha1' internal/broker/ cmd/mcp-broker-router/)
  • Verify config secret schema matches internal/config/types.go

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive design proposing splitting the monorepo into separate operand and operator repositories. Defines clear repo boundaries, runtime and config contracts between operator and operand, deployment and health-check expectations, CI/CD and release-flow guidance, packaging/artifact ownership, cross-repo compatibility/testing workflow, and a three-phase migration plan with compatibility and migration guidance.

Covers repository boundary, operator-operand runtime contracts
(config secret schema, status endpoint, deployment contract),
CI/CD pipeline split, OLM/release pipeline considerations,
phased migration plan, and full coupling analysis.

Relates-to: #1020
Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

New design document formalizing splitting the mcp-gateway monorepo into operand (mcp-gateway) and operator (mcp-gateway-operator) repos. It defines repo boundaries, the mcp-gateway-config Secret schema, operator /status and deployment contracts, MCP dual-spec expectations, CI/CD and OLM packaging split, and a three-phase migration and artifact ownership plan.

Changes

Repository Split Design

Layer / File(s) Summary
Problem statement, boundaries, and repository structure
docs/design/repo-split/repo-split-design.md
Introduces motivation for separating control-plane operator and data-plane operand into independently released repos with no compile-time Go dependencies; specifies directory layout, cmd entrypoints, and package groupings for each repo post-split.
Configuration Secret and /status contract
docs/design/repo-split/repo-split-design.md
Defines the mcp-gateway-config Secret schema and config.yaml shape, operand ignore-unknown-fields behavior and configVersion warning semantics; specifies operator /status polling contract and operator-controlled deployment requirements (image selection, stable CLI/env vars, fixed ports, /readyz, and config mount path).
MCP transition, CI/CD split, and OLM movement
docs/design/repo-split/repo-split-design.md
Documents operand dual-spec requirement during MCP transition; splits CI/CD with cross-repo compatibility jobs, operand Helm e2e and operator envtest, release sequencing with operator pinning operand versions; moves OLM/packaging artifacts to the operator repo.
Phased migration plan, artifact mapping, and post-split expectations
docs/design/repo-split/repo-split-design.md
Details three-phase migration (decouple compile-time imports, create operator repo with new go.mod, clean operand repo), includes "what moves/what stays" ownership tables, and post-split Makefile/docs/test-server responsibilities.
Compile-time coupling analysis and precedent
docs/design/repo-split/repo-split-design.md
Enumerates compile-time coupling points to resolve in Phase 1 and compares the proposal to Kuadrant precedents (authorino/limitador).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Suggested labels

review-effort/large

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a design document for splitting the monorepo into operator and operand repositories.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-1020-split-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the review-effort/medium Medium review effort (3): few files, moderate logic label Jun 4, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/design/repo-split/repo-split-design.md (1)

350-350: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Ensure the file ends with a trailing newline.

Line 350 appears to be EOF without a final newline; please add one to satisfy repo-wide file rules.

As per coding guidelines: "Files must end with a newline."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/repo-split/repo-split-design.md` at line 350, The file
docs/design/repo-split/repo-split-design.md is missing a trailing newline at
EOF; open the file and add a single newline character at the end so the file
ends with a newline (ensure your editor/save settings preserve it), then commit
the change so the file complies with the "Files must end with a newline" rule.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/repo-split/repo-split-design.md`:
- Around line 41-59: The fenced code blocks containing repository tree/layout
diagrams (the triple-backtick blocks showing "mcp-gateway/ ... go.mod" and the
other similar blocks around the same sections) are missing a language identifier
and should be updated to include one (e.g., add "text" after the opening ```);
locate the three markdown fenced blocks referenced in the diff (the tree/layout
blocks at the earlier block and the other two blocks noted in the comment) and
change their opening fences from ``` to ```text so the linter MD040 is
satisfied.
- Around line 7-349: The design doc docs/design/repo-split/repo-split-design.md
is missing mandated sections and subsections; add top-level sections "Problem",
"Goals/Non-Goals", "Job Stories", "Security Considerations", "Relationship to
Existing Approaches", "Future Considerations", and "Execution", and within the
existing "Design" section add subsections "Prerequisites", "Flow", "Component
Responsibilities", "API Changes", and "Data storage". Insert concise content
matching the repo-split context (referencing symbols like mcp-gateway,
mcp-gateway-operator, Config Secret `mcp-gateway-config`,
StatusResponse/ServerStatus structs, and the three-phase Migration Plan) so each
new section clearly states the issue, objectives, responsibilities, expected
API/format changes (e.g., configVersion and /status contract), storage impacts,
security considerations, and an explicit Execution plan aligned to the existing
Phase 1–3 migration steps.
- Around line 20-218: Add a mermaid flow diagram illustrating the
operator↔operand interactions (deployments, config secret write/read, status
call to GET /status, image/flag/env var/port contracts) and insert it into the
Design section (near "Operator-Operand Contract" or right after "Repository
Structure"); also add persona-based job stories (one-liners using the required
template "When <situation>, <actor> <context> wants <action/outcome> so that
<benefit>") covering the listed personas — Platform Engineer, MCP client user,
MCP Developer, non-interactive agent — and include both happy-path and at least
one edge-case story for each (e.g., unknown configVersion, disabled server,
missing env var, protocol version mismatch), ensuring each story references the
relevant contract pieces like the config secret (mcp-gateway-config), GET
/status, and deployment contract items.

---

Outside diff comments:
In `@docs/design/repo-split/repo-split-design.md`:
- Line 350: The file docs/design/repo-split/repo-split-design.md is missing a
trailing newline at EOF; open the file and add a single newline character at the
end so the file ends with a newline (ensure your editor/save settings preserve
it), then commit the change so the file complies with the "Files must end with a
newline" rule.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1c955975-7668-490c-ab20-346b152b0417

📥 Commits

Reviewing files that changed from the base of the PR and between 322a175 and 0289a54.

📒 Files selected for processing (1)
  • docs/design/repo-split/repo-split-design.md

Comment thread docs/design/repo-split/repo-split-design.md
Comment thread docs/design/repo-split/repo-split-design.md
Comment thread docs/design/repo-split/repo-split-design.md Outdated
Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
Comment thread docs/design/repo-split/repo-split-design.md
Comment thread docs/design/repo-split/repo-split-design.md

#### Cross-Repo E2E

The operator repo's e2e tests pull the operand image by tag:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is we want a bunch of e2e tests to execute on changes to the gateway not just on changes to the operator

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the operand repo's CI builds a temporary image from the PR, deploys the latest released operator (via Helm), and runs the operator repo's e2e suite against that image.

The e2e test definitions stay in the operator repo as the single source of truth — the operand CI pulls them at test time (git checkout or similar). The operator repo continues running the same suite on its own PRs against a released operand image.

This gives full e2e coverage on operand PRs without duplicating the test suite or adding cross-repo dispatch latency.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes that works in many cases. But it doesn't work when the operator and the operand change together.
This is not that un-common, by nature they are tightly coupled components, this is why the single monorepo works so well.

Without the mono repo you end up In a chicken and egg situation where a new flag is added or env var you end up with a chicken and egg situation. (operator needs new image of the operand to properly test , operand needs a new image of the operator to properly test).

To do this properly we probably need the following:

  • Helm based deployment in the operand repo that deploys the gateway (without operator)

  • Gateway focused e2e test suite in the operand repo that executes against functionality of the gateway (creates the config secret with expected config, sets the needed env vars / flags to enable the features etc under test , uses the test servers). Main functionality it tested here.

  • Helm based deployment in the operator repo that deploys the operator

  • Integration (controller envtest suite )in the operator repo that tests it created the deployment and secret etc correctly based on MCPServerReg , MCPGatewayExt etc...

  • Manually triggered job that deploys version x of the operand with version y of the operator. (could execute nightly also, and must be executed on release).

So to summerise:

Operand repo — tests the gateway as a gateway. Direct config secret, direct flags, no operator in the loop. This is where most feature work lands and where most e2e coverage lives.

Operator repo — tests the controller as a controller. envtest proves it produces correct Deployments/Secrets/RBAC from CRs. No real gateway needed.

Cross-repo job — the integration point. Manually triggered or nightly, catches version skew. Required gate before release.

Or we just keep it as a mono repo :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the e2e suggestion to the design doc, keeping it a mono repo I think is a wider discussion 😆


### OLM and Release Pipeline

CI is GitHub Actions-based — no Tekton pipelines to split. The OLM packaging artifacts (`bundle/`, `catalog/`, `bundle.Dockerfile`, `build/olm.mk`, `utils/generate-catalog.sh`) and AppStudio ReleasePlan resources all move to the operator repo. The operand repo has no OLM footprint.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppStudio?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old name for konflux, will update.

…oduct build repos

Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
@coderabbitai coderabbitai Bot added review-effort/small Low review effort (1-2): straightforward, single file, config/docs and removed review-effort/medium Medium review effort (3): few files, moderate logic labels Jun 4, 2026
Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
@coderabbitai coderabbitai Bot added review-effort/large High review effort (4-5): many files, complex, cross-cutting and removed review-effort/small Low review effort (1-2): straightforward, single file, config/docs labels Jun 5, 2026
│ ├── mcp-system/ # K8s deployment manifests
│ ├── test-servers/ # K8s manifests for test servers
│ └── samples/ # example manifests
├── tests/e2e/ # full e2e tests (pulls released operand image)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is out of date now

- The operator must not remove a field without a deprecation period spanning at least one operand release, preventing older operands from receiving a config missing a field they expect.
- New required fields must have sensible defaults so older operators that don't set them still produce valid config.
- Breaking changes require a coordinated release with both repos tagged.
- The config includes a top-level `configVersion: 1` field. The operand checks this on load — known versions are processed normally; unknown versions log a warning with the expected and actual values. This enables detection of operator/operand version mismatches without hard failures.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is deciding the configVersion? Is that based on something like the operator version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a schema version, not tied to the operator release version. Renamed to schemaVersion to make that clearer.

It only bumps when the config secret format has a breaking structural change (field renames, type changes, removed fields). Additive changes like new optional fields don't bump it. This way the operand just checks "do I understand this config shape?" — a simple integer comparison — without needing to know anything about operator release versions. The operator can ship many releases without the schema version changing at all.

| Health | HTTP | `GET /readyz` on port 8080 |
| Config mount | Volume | `/config/config.yaml` |

### MCP Spec Considerations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cab remove this section. It has no bearing on the the actual design imo


A workflow in the operator repo that deploys both components together and runs full integration tests:

- **Nightly:** Latest released operator + latest released operand. Catches drift.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should run main against main so build a "nigthly" or something like that tag and execute the :nightly operator with the :nightly operand

A workflow in the operator repo that deploys both components together and runs full integration tests:

- **Nightly:** Latest released operator + latest released operand. Catches drift.
- **Manual:** Accepts image overrides for both components (e.g., operand PR image + operator PR image). Used for coordinated changes that touch both repos.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This manual one allows us to do something custom. Like run rc1 of the operand with v1.5 of the operator cause there were no changes to the operator code


- **Operator repo** gets a new `Makefile` with controller-specific targets: `generate`, `manifests`, `docker-build`, `test-unit`, `test-controller-integration`, `lint`, `helm-*`, and CRD generation (`controller-gen`).
- **Operand repo** retains the existing `Makefile` but removes controller-specific targets (`docker-build-controller`, `generate`, `manifests`, CRD generation). Keeps `build`, `docker-build`, `test-unit`, `lint`, and performance test targets.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have an operator make command that will deploy the operator with a sepecific version of the operand. So from the operand we could run make load-image and then `make deploy-operator OPERAND_IMAGE=xxxx:y


### Test Server Images

Test server source code (`tests/servers/`) stays in the operand repo and produces test server container images. The operator repo's e2e tests reference these images. The operator repo contains the Kubernetes manifests (`config/test-servers/`) that deploy them but does not build them — it pulls pre-built images from the operand repo's CI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need tests/servers in the operator repo? Perhaps we just need one so we can test MCPServerRegistration?

@maleck13 maleck13 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few more minor things but getting there

- Update operator repo structure (catalog, rbac, minimal test server)
- Clarify configVersion is a schema version, not release version
- Remove MCP Spec Considerations section (no bearing on design)
- Nightly cross-repo job builds from main, not released images
- Add make deploy-operator OPERAND_IMAGE= target to Makefile section
- Reduce operator test servers to one minimal server for MCPServerRegistration

Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
Signed-off-by: Patryk Stefanski <pstefans@redhat.com>
@david-martin

Copy link
Copy Markdown
Member

The doc covers which code moves where, but the user-facing docs boundary could be stated more explicitly. The "What Moves" table lists docs/guides/ and docs/reference/, and the docs.kuadrant.io section mentions updating mkdocs.yml, but a reader (or agent/claude) could miss that all user-facing documentation lives in the operator repo post-split.

A few places where this could be clearer:

  • Repository Boundary table: no mention of docs ownership. A row for user-facing docs (or a note that the operand repo has none) would make the boundary obvious upfront.
  • Repository Structure trees: neither tree shows docs/guides/ or docs/reference/. The operator tree should reflect that these live there.
  • docs.kuadrant.io section: says to update mkdocs.yml references but doesn't explicitly state the principle: all user-facing docs live in the operator repo.
  • Phase 3 cleanup: lists code to remove from the operand but doesn't mention removing docs/guides/ and docs/reference/.

@david-martin

Copy link
Copy Markdown
Member

Let's hold this, pending alignment with Kuadrant/architecture#179
@maleck13 I know you had some thoughts on how including mcp gateway in the umbrella operator work could substitute for this work

@maleck13

Copy link
Copy Markdown
Contributor

Yes it would be worth thinking about. Rather than splitting out the operator. That said Kuadrant is not currently a multi-tenant deployment but MCP-Gateway is. If we could put the effort we were going to put into this work into the umbrella operator approach that would likely be a better use of time

@maleck13

Copy link
Copy Markdown
Contributor

cc @mikenairn

@Patryk-Stefanski

Copy link
Copy Markdown
Contributor Author

/closing for now , superseded by #1198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-effort/large High review effort (4-5): many files, complex, cross-cutting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants