From 59e72598be33517e9915d1ed0cacd586594674b1 Mon Sep 17 00:00:00 2001 From: Jim Fitzpatrick Date: Tue, 8 Apr 2025 15:05:46 +0100 Subject: [PATCH 1/3] ADD: RFC Resilient Deployment Start the RFC for Resilient Deployment Signed-off-by: Jim Fitzpatrick --- rfcs/0000-resilient_deployment.md | 182 ++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 rfcs/0000-resilient_deployment.md diff --git a/rfcs/0000-resilient_deployment.md b/rfcs/0000-resilient_deployment.md new file mode 100644 index 0000000..e330c0d --- /dev/null +++ b/rfcs/0000-resilient_deployment.md @@ -0,0 +1,182 @@ +# RFC Template + +- Feature Name: Resilient Deployment +- Start Date: 2025-04-07 +- RFC PR: [Kuadrant/architecture#0000](https://github.com/Kuadrant/architecture/pull/0000) +- Issue tracking: [Kuadrant/architecture#117](https://github.com/Kuadrant/architecture/issues/117) + +# Summary +[summary]: #summary + + + +# Motivation +[motivation]: #motivation + + +As our user move to deploying kuadrant into production environments there is a want to create more resilient deployments. +This includes deploying multiply replicas of the data plane components, Authorino & Limitador. +In the case of Limitador, this also includes being able to persist counters. + +As Kuadrant we want to provide a user experience that makes deploying a more resilient product possible. + + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + + + + + + + + + +At the core, there are two different areas of focus required here. +There is the resilience of the deployments, Authorino & Limitador, and there is the resilience of the counters used within Limitador. +These two areas address the resiliency as whole. + +For the deployments there are a number of configurations required. +- TopologyConstraints +- PodDisruptionBudget +- Resource Limits +- Replicas +With Authoring and Limitador there are different ways of configuring some of the features. + +The counters in Limitador can be persisted with external storage. +- [Counter storages](https://github.com/Kuadrant/limitador/blob/main/doc/server/configuration.md#counter-storages) +- [Limitador API](https://github.com/Kuadrant/limitador-operator/blob/main/api/v1alpha1/limitador_types.go#L203-L211) + +Configuration of the feature will be done in the Kuadrant CR. +This is to make it simpler for the user. +**API Design (WIP)** +```yaml +apiVersion: kuadrant.io/v1beta1 +kind: Kuadrant +spec: + resilience: + authorization: True + rateLimiting: True + counterStorage: {} # lifts storage struct from the limitador CR. +``` + +Equally important, if not more important will be relaying information back to the user on the status of the configuration. +What the status block looks like is unknown currently. + + +## Behavior +It is important that we understand what the behavior of this feature should be, and convey that behavior to the user clearly (documentation). +There are a number of uses that needs to be covered in the first iteration of this feature. +- New installations +- Existing installations with zero custom configuration +- Existing installations with custom configuration +- Removal of configuration +- SRE reaction events +- The user wants more + +### New installations +This is by far the simplest configuration. +By default, the all resilience configuration will be blank. +Thus, the feature will be disabled. +One reason for this choice is area the counter storage requiring configuration. + +### Existing installations with zero custom configuration +This act very much like a new installation. +The resilience configuration will be blank, thus, the feature is disabled. + +Once the feature is enabled the all the relevant resources will be created, with the kuadrant-operator having ownership where possible. + +### Existing installations with custom configuration +In the case the installation has custom configuration nothing changes till the users defines the resilience feature in the Kuadrant CR. + +Once the user configures the resilience feature the kuadrant-operator will create any missing configurations, and take ownership of those configurations. +The kuadrant-operator will not modify, extend, update, or take ownership of any existing configuration[^1]. +However, the status will reflect there is configuration that the kuadrant-operator doesn't owner, or there is configuration outside what the kuadrant-operator expects. + +[^1]: This true for any resource that is not already managed by the kuadrant-operator. + +### Removal of configuration. +In this case the user has configured the resilience features in the past, and is turning them off. +Where the features are turned off, the kuadrant-operator will clean resources it owns. +Even if the user has modified that configuration. + +This needs to be called out very clearly in the documentation. +Why we do this is to simplify the reconciliation of configuration. + +### SRE reaction events +From time to time SRE teams would need to change configurations for other to address issues. +We should allow the user configure the deployment to a state that is we regard as not resilient. +The one example that comes to mind is scaling replicas to zero for address some issue. +The user should be allowed to that, but the Kuadrant CR should state the configuration is below expected spec. +The below spec would be regarded as an error. + +### The user wants more +The user knows more about their deployment than we ever can. +We need to allow the user to modify the configuration to suit their needs. +When a user modifies the configuration, the kuadrant-operator should not reconcile back the changes. + +If the user changes are below the minimum spec that we state, this should be reflected in the status. +If the user changes the configuration at all it should be reflected in the status. +However, when the spec is more than what we regard as minimum the status should be a highlight not warning or error. +How this is reflected in the status of the Kuadrant CR is unknown currently. + + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + + + + + + + + + + +# Drawbacks +[drawbacks]: #drawbacks + + + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + + + + + +# Prior art +[prior-art]: #prior-art + + + + + + + + + + + + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + + + + + +Should resilience be enabled by default? +While the simple answer would be yes, there is the issue of a configuration being required for counter persistence's. + +While the kuadrant-operator will not take ownership of existing configuration. +What the sub operators would do is unknown at this stage if the feature is enabled via their CRs. + +# Future possibilities +[future-possibilities]: #future-possibilities + + + + + + From 7157a8f50bfbb72eb2318e036575d31907bbc43c Mon Sep 17 00:00:00 2001 From: Jim Fitzpatrick Date: Fri, 11 Apr 2025 09:42:52 +0100 Subject: [PATCH 2/3] UPDATE: Expand the content. Signed-off-by: Jim Fitzpatrick --- rfcs/0000-resilient_deployment.md | 159 +++++++++++++++++++++++++++++- 1 file changed, 158 insertions(+), 1 deletion(-) diff --git a/rfcs/0000-resilient_deployment.md b/rfcs/0000-resilient_deployment.md index e330c0d..c544da1 100644 --- a/rfcs/0000-resilient_deployment.md +++ b/rfcs/0000-resilient_deployment.md @@ -2,7 +2,7 @@ - Feature Name: Resilient Deployment - Start Date: 2025-04-07 -- RFC PR: [Kuadrant/architecture#0000](https://github.com/Kuadrant/architecture/pull/0000) +- RFC PR: [Kuadrant/architecture#119](https://github.com/Kuadrant/architecture/pull/119) - Issue tracking: [Kuadrant/architecture#117](https://github.com/Kuadrant/architecture/issues/117) # Summary @@ -132,6 +132,152 @@ How this is reflected in the status of the Kuadrant CR is unknown currently. +Give the spec as below these are the steps that would be required. +```yaml +apiVersion: kuadrant.io/v1beta1 +kind: Kuadrant +spec: + resilience: + authorization: True + rateLimiting: True + counterStorage: {} # lifts storage struct from the limitador CR. +``` + +## spec.resilience.authorization +For `spec.resilience.authorization` we can make use of some features within the authorino CR, but will require updating to the authorino deployment and creation of PodDiruptionBudget CR. + +The authorino CR allows the setting of replicas `spec.replicas`. +This should be set to 2. +The user should have the power to modify this number to what they want. +If the number is less than the minimum we recommend (2), the kuadrant CR should report an error in the status. +If the user uses a number greater than what we recommend, the kuadrant CR should report an information status. +```yaml +kind: Authorino +metadata: + name: authorino +spec: + replicas: 2 +``` + +The deployment for authorino needs to be modified in the following was. +`resources.requests` and `topologySpreadConstraints` need to be added. +The kuadrant status is easier to manage with the `resources.requests` as it can be less than what is recommended. +Values for the `resources.requests` need to be discovered. +The `topologySpreadConstraints` on the other hand is not as easy to state if it is out of spec in a good way (higher spec) or a bad way (lesser spec). +Still the status in the kuadrant CR needs to reflect these resources being out of spec. +```yaml +kind: Deployment +metadata: + name: authorino +spec: + template: + spec: + containers: + - name: authorino + resources: + requests: + cpu: 10m + memory: 10Mi + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: kubernetes.io/hostname + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: + authorino-resource: authorino + - maxSkew: 1 + topologyKey: kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: + authorino-resource: authorino +``` + +The PodDisruptionBudget is the next resource that needs to be created. +As this is a resource that is none existent in a normal deployment of kuadrant. +The kuadrant-operator will be required to take ownership of the resource. +```yaml +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: authorino +spec: + maxUnavailable: 1 + selector: + matchLabels: + authorino-resource: authorino +``` + + +## spec.resilience.rateLimiting +For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but will be requiring modifying the limitador deployment for the `topologySpreadConstraints`. + +The limitador CR allows the setting of the `pdb`(PodDisruptionBudget), `resourceRequirements.requests`, and `replicas`. +These resources follow the same user updating feature. +When the feature is active the status of each section should be reported to when out of spec. +```yaml +apiVersion: limitador.kuadrant.io/v1alpha1 +kind: Limitador +metadata: + name: limitador +spec: + pdb: + maxUnavailable: 1 + replicas: 2 + resourceRequirements: + requests: + cpu: 10m + memory: 10Mi +``` + +The `topologySpreadCondtraints` needs to be configured with in the limitador deployment CR. +```yaml +# patches (merge) the limitador-operator owned Deployment with this partial resource +apiVersion: apps/v1 +kind: Deployment +metadata: + name: limitador-limitador +spec: + template: + spec: + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: kubernetes.io/hostname + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: + limitador-resource: limitador + - maxSkew: 1 + topologyKey: kubernetes.io/zone + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: + limitador-resource: limitador +``` + +Limitador does require persisted storage. +This can be configured in the limitador CR under `spec.storage`. +If this section is missing from the limitador CR, and the `spec.resilience.rateLimiting`, the kuadrant CR needs to raise a warning in the status. +The status is only a warning as it could be possible the user wants in memory counters. + +## spec.resilience.counterStorage +For `spec.resilience.counterStorage` the configuration will be added to the limitador CR under the `spec.storage` section. + +In the kuadrant CR this will be an object that matches the structure of the limitador [spec.storage](https://github.com/Kuadrant/limitador-operator/blob/626341d2aff5f6b8028317dc0e7d1bb27eb8d3d4/api/v1alpha1/limitador_types.go#L203-L212). +Unlike the other resilience features, once configured the kuadrant-operator need to maintain the configuration within the limitador CR. + +This will introduce an upgrade issue. +If the user has configured the storage option with in the limitador CR prior to this feature added, their configuration will be removed. + +## Kuadrant CR Status +The kuadrant CR Status block will be used to tell the user the state of the different features that have being enabled. +Each feature can raise warnings and errors independent of each other. + +## doc.kuadrant.io guides +On [docs.kuadrant.io](https://docs.kuadrant.io) there will be a section that describes the features of `spec.resilience`. +This will show examples of the configuration that each feature will configure, explain how to modify the configuration. + +What happens when the feature is disabled will be outlined, so the user can make an informed decision when disabling the feature. # Drawbacks [drawbacks]: #drawbacks @@ -159,6 +305,15 @@ How this is reflected in the status of the Kuadrant CR is unknown currently. +There has being some exploring work done, and guides to show a user how to set this up. +- [github.com/kuadrant/deployment](https://github.com/kuadrant/deployment) +- [Resilient Deployment of data plane compnents (docs guide)](https://docs.kuadrant.io/dev/install-olm/#resilient-deployment-of-data-plane-components) + +There are a number of other RFCs releating to the intruduction of features into the kuadrant CR. +- [RFC: Observability API](https://github.com/Kuadrant/architecture/pull/97) +- [RFC for mTLS](https://github.com/Kuadrant/architecture/pull/110) +- [RFC: Standardize the Kuadrant Spec](https://github.com/Kuadrant/architecture/pull/112) + # Unresolved questions [unresolved-questions]: #unresolved-questions @@ -172,6 +327,8 @@ While the simple answer would be yes, there is the issue of a configuration bein While the kuadrant-operator will not take ownership of existing configuration. What the sub operators would do is unknown at this stage if the feature is enabled via their CRs. +How to many counterStorage configuration on upgrades? More so how to many existing storage configurations when this feature is introduced? + # Future possibilities [future-possibilities]: #future-possibilities From bc1b94c5cacbaa3b1f5eac2f163fd6021d7f6334 Mon Sep 17 00:00:00 2001 From: Jim Fitzpatrick Date: Tue, 22 Apr 2025 09:45:41 +0100 Subject: [PATCH 3/3] UPDATE: Language and Details Address a number of issues raised during the review process Signed-off-by: Jim Fitzpatrick --- rfcs/0000-resilient_deployment.md | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/rfcs/0000-resilient_deployment.md b/rfcs/0000-resilient_deployment.md index c544da1..6ec3d60 100644 --- a/rfcs/0000-resilient_deployment.md +++ b/rfcs/0000-resilient_deployment.md @@ -19,6 +19,7 @@ This includes deploying multiply replicas of the data plane components, Authorin In the case of Limitador, this also includes being able to persist counters. As Kuadrant we want to provide a user experience that makes deploying a more resilient product possible. +Kuadrant will provide an opinionated data plane configuration based on our understanding of the components within kuadrant, and common topology setups. # Guide-level explanation @@ -41,7 +42,7 @@ For the deployments there are a number of configurations required. - PodDisruptionBudget - Resource Limits - Replicas -With Authoring and Limitador there are different ways of configuring some of the features. +With Authorino and Limitador there are different ways of configuring some of the features. The counters in Limitador can be persisted with external storage. - [Counter storages](https://github.com/Kuadrant/limitador/blob/main/doc/server/configuration.md#counter-storages) @@ -49,7 +50,7 @@ The counters in Limitador can be persisted with external storage. Configuration of the feature will be done in the Kuadrant CR. This is to make it simpler for the user. -**API Design (WIP)** +**API Design** ```yaml apiVersion: kuadrant.io/v1beta1 kind: Kuadrant @@ -78,20 +79,21 @@ There are a number of uses that needs to be covered in the first iteration of th This is by far the simplest configuration. By default, the all resilience configuration will be blank. Thus, the feature will be disabled. -One reason for this choice is area the counter storage requiring configuration. +Counter storage configuration requires external resources, it is not possible for kuadrant to prepopulate the configuration. +This means the operator can not enable the resilience for limitador by default. ### Existing installations with zero custom configuration This act very much like a new installation. The resilience configuration will be blank, thus, the feature is disabled. -Once the feature is enabled the all the relevant resources will be created, with the kuadrant-operator having ownership where possible. +Once the feature is enabled the all relevant resources will be created, with the kuadrant-operator having ownership where possible. ### Existing installations with custom configuration In the case the installation has custom configuration nothing changes till the users defines the resilience feature in the Kuadrant CR. -Once the user configures the resilience feature the kuadrant-operator will create any missing configurations, and take ownership of those configurations. +Once the user configures the resilience feature, the kuadrant-operator will create any missing configuration, and take ownership of them. The kuadrant-operator will not modify, extend, update, or take ownership of any existing configuration[^1]. -However, the status will reflect there is configuration that the kuadrant-operator doesn't owner, or there is configuration outside what the kuadrant-operator expects. +However, the status will reflect there is configuration that the kuadrant-operator doesn't own, or there is configuration outside what the kuadrant-operator expects. [^1]: This true for any resource that is not already managed by the kuadrant-operator. @@ -104,10 +106,10 @@ This needs to be called out very clearly in the documentation. Why we do this is to simplify the reconciliation of configuration. ### SRE reaction events -From time to time SRE teams would need to change configurations for other to address issues. +From time to time SRE teams may need to change some configurations to address specific issues. We should allow the user configure the deployment to a state that is we regard as not resilient. -The one example that comes to mind is scaling replicas to zero for address some issue. -The user should be allowed to that, but the Kuadrant CR should state the configuration is below expected spec. +The one example that comes to mind is scaling replicas to zero to address some issue. +The user should be allowed to do that, but the Kuadrant CR should state the configuration is below expected spec. The below spec would be regarded as an error. ### The user wants more @@ -115,6 +117,10 @@ The user knows more about their deployment than we ever can. We need to allow the user to modify the configuration to suit their needs. When a user modifies the configuration, the kuadrant-operator should not reconcile back the changes. +However, this is only true for configuration within a resource. +If the user removes the resource CR completely the kuadrant-operator should recreate that resource. +When the kuadrant-operator recreates the resource, the operator will use the default value of the base configuration, and not try to recreate the resource with any previous configuration the user may have added to the resource. + If the user changes are below the minimum spec that we state, this should be reflected in the status. If the user changes the configuration at all it should be reflected in the status. However, when the spec is more than what we regard as minimum the status should be a highlight not warning or error. @@ -159,7 +165,7 @@ spec: replicas: 2 ``` -The deployment for authorino needs to be modified in the following was. +The deployment for authorino needs to be modified in the following ways. `resources.requests` and `topologySpreadConstraints` need to be added. The kuadrant status is easier to manage with the `resources.requests` as it can be less than what is recommended. Values for the `resources.requests` need to be discovered. @@ -210,7 +216,7 @@ spec: ## spec.resilience.rateLimiting -For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but will be requiring modifying the limitador deployment for the `topologySpreadConstraints`. +For `spec.resilience.rateLimiting` we can make use of most features within the limitador CR, but it will be required to modify the limitador deployment for the `topologySpreadConstraints`. The limitador CR allows the setting of the `pdb`(PodDisruptionBudget), `resourceRequirements.requests`, and `replicas`. These resources follow the same user updating feature.