NE-2732: Design proposal for Gateway API CRD management mode#2023
NE-2732: Design proposal for Gateway API CRD management mode#2023rikatz wants to merge 11 commits into
Conversation
|
I will help review |
| There is a desire to backport this feature all the way back to | ||
| OpenShift OCP 4.19. The motivation is to allow customers who are |
There was a problem hiding this comment.
Or even OCP 4.18, which would enable a smooth upgrade from 4.18 to 4.19.
There was a problem hiding this comment.
can we go that back on an API backport?
| This enhancement is fully applicable to OKE. Since the | ||
| [gateway-api-without-olm](gateway-api-without-olm.md) enhancement | ||
| enables Gateway API on OKE by eliminating OSSM licensing concerns, | ||
| all three CRD management modes are available on OKE clusters. |
There was a problem hiding this comment.
However, as noted in the "Backport to OCP 4.19" section, we might backport the API to versions of OKE that do rely on an OLM subscription for OSSM.
There was a problem hiding this comment.
but this is not a problem, right?
There was a problem hiding this comment.
The fact that we install OSSM through an OLM subscription shouldn't be a problem. My point is that "the
gateway-api-without-olm enhancement enables Gateway API on OKE by eliminating OSSM licensing concerns" is not relevant when we backport this feature to OCP versions that still depend on OLM for the gateway controller.
There was a problem hiding this comment.
I see. Well, I will add some better wording here given we are again discussing about backporting noOLM
|
|
||
| ## Proposal | ||
|
|
||
| Add a new `gatewayAPI` struct field to the `IngressControllerSpec` |
There was a problem hiding this comment.
A better place for the gatewayAPI field would be ingresses.config.openshift.io or even a new ingresses.operator.openshift.io resource. We previously proposed adding a ingresses.operator.openshift.io resource as a home for OperatorLogLevel in #1314. Whether we can backport the addition of a new resource is an open question.
There was a problem hiding this comment.
I prefer going to what is less friction at this point. If adding a new resource and backporting is a huge friction, I would not do it. As this subject is already "hot" whatever we can make to get less friction I prefer :)
There was a problem hiding this comment.
If we are going for least friction - should we do just do it as an annotation or magical GatewayClass name rather than putting it into an API that isn't quite right? 😄 Playing devils advocate
I wonder if we should have a hybrid approach. A "not so good way, but backports nicely" like tell users to create a "openshift-opt-out-crds" controller-name gatewayclass or an annotation on the the ingresses.config.openshift.io object (yea that's ugly I know). We'd start with and backport this solution.
Then a "proper api" for 5.0 moving forward: ingress.operator.openshift.io. We could delay that design for later, and possibly implement some migration logic, or at least a forced migration (Upgradeable=false) in 5.0 to make sure we can deprecate the "not so good way, but backports nicely".
Definitely more work for us, but does simplify the backporting.
There was a problem hiding this comment.
IIUC API reviewers are fine with us moving with a new API definition, I will go with this path
|
/assign |
|
@rikatz: This pull request references NE-2732 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
everettraven
left a comment
There was a problem hiding this comment.
Overall, this enhancement looks pretty good. I've only got a few comments on the API side of things.
| ### API Extensions | ||
|
|
||
| This enhancement introduces a **new CRD** in the | ||
| `operator.openshift.io/v1` group. The new resource follows the |
There was a problem hiding this comment.
Just a note, net-new APIs must be introduced in the v1alpha1 group so that we never accidentally ship them by default until the feature is GA.
There was a problem hiding this comment.
@Miciah you told me we could do v1 directly, so leaving the discussion here for you :)
happy to update the EP with the conclusion
There was a problem hiding this comment.
There was a problem hiding this comment.
sounds good! I will change here
| // crdManagementMode specifies how the Cluster Ingress | ||
| // Operator manages Gateway API Custom Resource Definitions | ||
| // (CRDs) and the associated Gateway controller stack. | ||
| // | ||
| // When set to "Managed" (the default), CIO installs, owns, | ||
| // and upgrades the Gateway API CRDs, protects them with a | ||
| // Validating Admission Policy, and deploys the full Gateway | ||
| // controller stack (the Istio instance deployed by CIO, | ||
| // GatewayClass, Gateway resources). This is the only fully | ||
| // supported configuration. | ||
| // | ||
| // When set to "Unmanaged", CIO does not install or manage | ||
| // Gateway API CRDs and does not deploy the Gateway controller | ||
| // stack. The cluster administrator or a third-party product | ||
| // is responsible for providing their own CRDs and Gateway | ||
| // controller. CIO reports observational status only. | ||
| // | ||
| // When set to "ExternalCRDs", CIO does not manage Gateway | ||
| // API CRDs but does deploy the OpenShift Gateway controller | ||
| // stack. The cluster administrator brings their own CRDs. | ||
| // This is an unsupported configuration intended for | ||
| // development and testing. |
There was a problem hiding this comment.
My understanding of the Gateway API is very rudimentary because that isn't my wheelhouse, but the scope of management that the field name implies and what these settings actually align with seems to be a bit at odds.
I wonder if a more generic managementMode is more appropriate here? something like:
apiVersion: operator.openshift.io/v1
kind: Ingress
metadata:
name: cluster
spec:
gatewayAPI:
managementMode: Managed | Unmanaged | ControllersOnly(naming is arbitrary here, feel free to use different mode names).
There was a problem hiding this comment.
no, I actually liked the proposal! Will change here, thanks!
| // When set to "ExternalCRDs", CIO does not manage Gateway | ||
| // API CRDs but does deploy the OpenShift Gateway controller | ||
| // stack. The cluster administrator brings their own CRDs. | ||
| // This is an unsupported configuration intended for | ||
| // development and testing. |
There was a problem hiding this comment.
I'm a bit hesitant to include this mode of operation. It isn't really clear to me what the use case for an end-user setting this mode would be such that they would get some kind of value out of it.
The only thing I can think of is to test if an existing version of the default OpenShift controllers can handle a changed API, but if we are documenting what the supported pairing is by shipping, testing, and vetting a fully-managed option I'm not sure why this would be something an end-user of OpenShift would care to put their cluster in an unsupported state for?
If you folks feel strongly that this should be an option for configuration, I'd prefer we make it abundantly clear via the enum value that this is unsupported with a naming pattern like UnsupportedCRDsOnlyUnmanaged
There was a problem hiding this comment.
this one is mostly about:
- layered products (MCP Gateway, RHOAI) wanting to test new CRDs with the current Istio version - eg.: some fields are promoted on a new version of Gateway API that we didn't vetted yet, and users want to test it
- users that previously deployed their CRD and are willing to upgrade and move toward ours but the process of upgrade means going through the upgrade path
btw on the original proposal from @knobunc this is exactly how it is worded and I am fine making the "Unsupported" word explicit part of the name:
[Possibly] No CRDs, but unsupported Gateway: If you install GWAPI and configure the OpenShift GatewayClass and Gateway, we would run our gateway, but this would be an unsupported configuration
| // deploys the full Gateway controller stack (the Istio | ||
| // instance deployed by CIO, GatewayClass, Gateway). This is | ||
| // the default mode and the only fully supported configuration. | ||
| ManagedGatewayAPICRDs GatewayAPIManagementMode = "Managed" |
There was a problem hiding this comment.
General convention is to use the pattern {alias}{representation}, so:
| ManagedGatewayAPICRDs GatewayAPIManagementMode = "Managed" | |
| GatewayAPIManagementModeManaged GatewayAPIManagementMode = "Managed" |
| // GatewayClass, Gateway). The customer brings their own | ||
| // CRDs. This is an UNSUPPORTED configuration useful for | ||
| // development, testing, or advanced users. | ||
| ControllersOnlyGatewayAPICRDs GatewayAPIManagementMode = "ControllersOnly" |
There was a problem hiding this comment.
If we are going to stick with allowing this mode to be configured, and it is truly an unsupported configuration, please make it abundantly clear by prefixing this mode with Unsupported.
It makes it abundantly clear to any future readers of the configuration that they are in an unsupported configuration mode and it makes users think twice because they have to type out the word Unsupported before flipping to that mode.
| ControllersOnlyGatewayAPICRDs GatewayAPIManagementMode = "ControllersOnly" | |
| GatewayAPIManagementModeUnsupportedControllersOnly GatewayAPIManagementMode = "UnsupportedControllersOnly" |
| The following conditions are set within | ||
| `status.gatewayAPI.conditions`: | ||
|
|
||
| | Condition Type | Status | Reason | Description | | ||
| |---|---|---|---| | ||
| | `CRDsManaged` | `True` | `ManagedByCIO` | CIO is actively managing CRDs | | ||
| | `CRDsManaged` | `False` | `Unmanaged` | Administrator chose Unmanaged mode | | ||
| | `CRDsManaged` | `False` | `ControllersOnly` | Administrator chose ControllersOnly mode (unsupported) | | ||
| | `CRDsPresent` | `True` | `CRDsFound` | Gateway API CRDs are present on the cluster | | ||
| | `CRDsPresent` | `False` | `CRDsNotFound` | Gateway API CRDs are not present on the cluster | | ||
| | `CRDsCompliant` | `True` | `VersionMatch` | Installed CRDs match the expected version | | ||
| | `CRDsCompliant` | `False` | `VersionMismatch` | Installed CRDs do not match the expected version. Message includes the expected and actual versions. | | ||
| | `CRDsCompliant` | `Unknown` | `NotApplicable` | Compliance check is not applicable (e.g., Unmanaged mode with no CRDs present) | |
There was a problem hiding this comment.
Just a note, it would be helpful to include this information in some form in the GoDoc for the conditions field of the status so that end-users can have this same information even when using something like kubectl explain ... to get the field documentation of the CR
|
|
||
| // GatewayAPIIngressStatus holds status information for Gateway | ||
| // API integration managed by the Cluster Ingress Operator. | ||
| type GatewayAPIIngressStatus struct { |
There was a problem hiding this comment.
Do we suspect that this Gateway API specific status blob is going to need to contain more than standard status conditions in the near future?
If not, it might be worth it to drop this in favor of the standard operator conditions, just prefixed with GatewayAPI (i.e GatewayAPICRDsManaged)
There was a problem hiding this comment.
yes, I am considering that for instance, we can extend this API in the future like:
- Do OSSM update per patch or minor release (there is an ongoing effort for it) so we will reproduce what version is being configured on status
- Allow cluster admin to pre-define GatewayClasses configurations, we may want to define if these classes were accepted and configured, mismatched, etc per class
There was a problem hiding this comment.
(or even non-gateway api CIO configuration as well)
There was a problem hiding this comment.
Very nice and thorough. EP - thanks for driving this forward. Hope my comments help.
There are mostly nits for the EP. I generally don't have any major concerns about the implementations. I think ingress.operator.openshif.io makes sense right now - but maybe it's worth talking more about future potential config that could be added to this.
| b. If CRDs are present and match the expected version, CIO takes | ||
| ownership (adds labels, deploys VAP). | ||
| c. If CRDs are present but do not match the expected version, CIO | ||
| sets `CRDsCompliant=False` (in `status.gatewayAPI.conditions`) |
There was a problem hiding this comment.
Would CIO set CRDsManaged to False or True in this case? The CRDs aren't compliant, so CIO isn't actually managing them — but it wants to manage them. I'd lean toward False since this is status, not intent, right?
More importantly, will CIO actually run the Gateway API controller logic (install OSSM) while CRDs are non-compliant upon returning to Managed? I could see arguments for both:
Run the OpenShift Gateway controller with non-compliant CRDs: This would let users with third-party or newer CRDs ease their way back into the managed Gateway API controller — they could test OpenShift Gateway alongside their existing setup before fully transitioning.
Refuse to start the Gateway controller while CRDs are non-compliant: Probably the safer move, given the risks of Istio reconciling fields based on CRD schema presence.
I think this should be covered in the EP somewhere either way.
There was a problem hiding this comment.
And lastly, should we go degraded=true or upgradeable=false when CRDsCompliant=false and we are trying to manage them?
There was a problem hiding this comment.
I don't think we should block upgrades, but I think that when CRDsCompliant=false we should also not Manage them, and add a status condition reason on why.
For the former question, I think the CRDsManaged should be False, and for now given this is unsupported and we have discussed this formerly on this EP, it is either "Managed = all in, Unmanaged = nothing" (we have dropped actually the UnsupportedControllerOnly knob that was originally proposed).
Maybe we can soften this restriction later, but I would prefer to start strict and soft, than start soft and figure out things will break
| // Ingress holds cluster-wide configuration for Gateway API | ||
| // integration managed by the Cluster Ingress Operator. The | ||
| // canonical name is `cluster`. |
There was a problem hiding this comment.
nit we can take this up in the API PR - but it does more than holds cluster-wide configuration for Gateway API integration. I think we'd want to reword that to be generic operator configuration (i.e. we might want to add #1314).
There was a problem hiding this comment.
good call, I have asked claude when writing to make it generic and missed this description. I will take the original proposal from logLevel EP
|
|
||
| ## Open Questions | ||
|
|
||
| 1. **CRD cleanup on Unmanaged transition**: Should CIO offer an |
There was a problem hiding this comment.
AKA self destruct button 😸 I vote no.
There was a problem hiding this comment.
lol I vote no as well, I will remove this open question actually
| override" mechanism to bypass CRD succession checks. Should | ||
| `managementMode` replace this mechanism or coexist with it? | ||
|
|
||
| 3. **Telemetry**: What telemetry should be collected? At minimum |
There was a problem hiding this comment.
PM discussion - but I wonder if it would be valuable to have another CRDsVersion status condition and report that to Telemetry so we understand versioning needs. Could be future goals.
There was a problem hiding this comment.
agreed. I think I will add the CRDVersion also as an open question, doesn't need to be explicit goal right now, more like "we know this may be needed"
| would delete all Gateway/HTTPRoute resources. The current | ||
| proposal preserves CRDs during transitions. | ||
|
|
||
| 2. **Interaction with `unsupportedConfigOverrides`**: The CRD |
There was a problem hiding this comment.
I'm not sure what this is referring to, something in https://github.com/openshift/enhancements/blob/master/enhancements/ingress/gateway-api-crd-life-cycle-management.md? I don't see that have any unsupportedConfigOverrides for CRD management yet.
| 5. **Kind name collision**: The proposed Kind `Ingress` in | ||
| `operator.openshift.io/v1alpha1` shares a name with the well-known | ||
| `networking.k8s.io/v1` Ingress. While API groups disambiguate, | ||
| this may cause user confusion with `oc get ingress`. The short |
There was a problem hiding this comment.
this current exists for dns.operator.openshift.io and dns.config.openshift.io and yes it's pretty annoying 😄 but not a deal breaker.
There was a problem hiding this comment.
I think this is a filler, removed
| compliance and mode transitions? | ||
|
|
||
| 4. **Singleton creation**: Should the `cluster` singleton instance | ||
| be created by CVO (via a manifest in the release payload) or |
There was a problem hiding this comment.
Don't quote me on it, but I think CVO is the right answer. You can place it in the manifest directory in cluster-ingress-operator, and the CVO installs it.
There was a problem hiding this comment.
I think CVO does not create it, per the comment here, but I don't know as well. Deferring to @Miciah
| customers on 4.19 with third-party Gateway API CRDs face CRD | ||
| succession conflicts when upgrading, because CIO enforces | ||
| ownership with no opt-out. Backporting lets customers set | ||
| `Unmanaged` before upgrading. |
There was a problem hiding this comment.
nit I get what you are saying, but this is confusing.
I think it really means 4.18 when it talks about "succession conflicts when upgrading". The benefit for 4.19 is the same as 4.20, 4.21, 4.22, and 5.0: new 3rd party implementation requiring newer CRDs can be enabled.
The upgrade problem is solved by the 4.18 backport, not by the 4.19 backport.
everettraven
left a comment
There was a problem hiding this comment.
LGTM from an API perspective.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: everettraven The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
gcs278
left a comment
There was a problem hiding this comment.
I have only nits and non-actionable comments left. The EP is well written and seems straightforward. The placement of this API makes sense in ingress.operator.openshift.io, but I do wonder if all of our future Gateway API knobs will make sense here. I think it's fair to delay that conversation for later.
Thanks!
/lgtm
| 4. CIO sets the following conditions in `status.conditions`: | ||
| - `GatewayAPICRDsManaged=False` (reason: `Unmanaged`) | ||
| - `GatewayAPICRDsPresent=True/False` (observational) | ||
| - `GatewayAPICRDsCompliant=Unknown` (not applicable in this mode) |
There was a problem hiding this comment.
nit Why exactly is it not applicable? Couldn't we still report whether the CRDs are compliant even though we are in Unmanaged?
It'd give the admin a early signal to understand if they are ready to go back to Managed (without having to push that button to get the GatewayAPICRDsCompliant result). We do have the GatewayAPICRDsCompliant logic written either way - might as well keep it running as a signal?
it also is a bit inconsistent with the go docs for GatewayAPICRDsCompliant
// - status: Unknown, reason: "NotApplicable" — compliance
// check is not applicable (e.g., Unmanaged mode with no
// CRDs present).
So should this also specify that it's Unknown only when there aren't CRDs, or is it always Unknown while in Unmanaged?
There was a problem hiding this comment.
I agree with you, good call. I think we should add the condition regardless of Unmanaged or Managed, to signal earlier that the admin can go back to Managed
| // chose Unmanaged mode; CIO does not manage CRDs or the | ||
| // Gateway controller stack. |
There was a problem hiding this comment.
nit just pointing out that GatewayAPICRDsManaged is a bit overloaded for describing these two things, but since they are always coupled, I don't think it's a big deal - we can decompose later if it's a problem. Just an FYI.
There was a problem hiding this comment.
yeah, I think the name is fine for now tbh. CRDsManaged is more like "are we managing them or not" so in the future we can have something like ControllersManaged, etc.
| type GatewayAPIManagementMode string | ||
|
|
||
| const ( | ||
| // GatewayAPIManagementModeManaged means CIO installs, owns, |
There was a problem hiding this comment.
nit when you get to the api PR - I think we should spell out CIO, as ingress operator. I don't think we use CIO anywhere else in user facing docs.
There was a problem hiding this comment.
commented on https://github.com/openshift/api/pull/2890/changes#r3437544164
is it the case where the condition should also be like
status: True, reason: "ManagedByIngressOperator"
instead of ManagedByCIO? ManagedByCIO is just to keep the consistency with noOLM: https://github.com/openshift/cluster-ingress-operator/blob/bf13f251347308b80b170582de455c59cb0510be/pkg/operator/controller/gatewayclass/status.go#L24
There was a problem hiding this comment.
Oh yea good question - I'd spell it out just to be safe.
We are actually losing ManagedByCIO in openshift/cluster-ingress-operator#1456 - the recent upstream changes to sail library collapses it into just a Ready state, so that might be a thing of the past.
|
Thanks for the updates! |
|
@rikatz: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| gateways.gateway.networking.k8s.io \ | ||
| httproutes.gateway.networking.k8s.io \ | ||
| referencegrants.gateway.networking.k8s.io \ | ||
| grpcroutes.gateway.networking.k8s.io |
This proposal establishes a new knob to allow users to manage their Gateway API CRDs within an OCP cluster, establishing 2 modes: Managed and Unmanaged
This proposal intends to provide users with a knob that allows them to manage the Gateway API CRDs without fully losing support of an OCP cluster, and to rollback on their decision in case they want back OCP Gateway API, and also allow the usage of external third party controllers.