Proposal for GA version of automated etcd backups#2035
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc @hasbro17 |
|
Supersedes #1646 |
8d307ef to
04ea6e2
Compare
|
|
||
| - Is there any need for `EtcdBackupPolicy` to be generalized for components other than etcd? In the original design doc, the `Backup` API had an etcd field so it could support additional types later. In my opinion this API is cleaner when streamlined for the specific use-case we are designing for, and if a new type of backup was needed in future that should be its own separate API. | ||
|
|
||
| - Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize). |
There was a problem hiding this comment.
I think for starters, you should just introduce the minimum quantity as the only retention option and cull by age.
We had a lot of trouble getting the retention to work correctly due to:
- ordering issues: you need to run retention before taking a backup to free space - but that means you may go over the max size and count limit after running the backups (IIRC we have funny assertions in our integration tests for this)
- unknown backup size: we don't have the information about how big the incoming backup snapshot will be, so it's very difficult to keep it within MaxSize. This is particularly annoying when a customer defines a PV with 10 gigs of space, sets the max size to 10 gigs but then still runs out of space.
Creating such a big test matrix with different rules chained together is probably confusing for our customers and very difficult to get right and properly test. Imagine you have to constantly take backups and prune them as well.
There was a problem hiding this comment.
The design here is more of a threshold to trigger pruning than a hard cap. Because yes it gets tricky when you try to preemptively prune for the snapshot that hasn't been taken yet. My idea was to do all prune operations at the API level, and then reconcile that with the state of files in storage. Totally separate from snapshotting schedule, no promises made about order of operations.
But I'm fine with simplifying retention options here so long as the API design allows for extensions in the future. I think that makes sense, and will certainly make testing much easier.
There was a problem hiding this comment.
I leave this up to you guys, just wanted to share what a PITA it was to be smart about it before ;)
| - Should retention rules also include MinQuantity that enforces a minimum N backups are retained when combined with other rules (e.g. MaxAge, MaxSize). | ||
| - Would prevent cases where new backups have failed and all the existing backups get aged out | ||
|
|
||
| - Creating an EtcdBackup could allow a user to escape their namespace and privileges. This feature requires an auth check to only allow cluster-admins to interact with the backups. |
There was a problem hiding this comment.
is that so? you'd need to have admin rbac to create those, won't you?
There was a problem hiding this comment.
This was a note from @atiratree. A user granted RBAC permission to create an EtcdBackup would effectively be able to read anything in the cluster by backing up to a volume they control.
| - Use an existing open source backup solution such as Velero | ||
| - To be discussed further with the team that made the original enhancement |
There was a problem hiding this comment.
I know we're going to discuss this on Friday, but unless you find the last three PMs that were involved in this - I don't think we'll really find an answer.
There was a problem hiding this comment.
Figured, that's fine. Included here just in case anyone had something to say about it before we have our meeting.
04ea6e2 to
1dda643
Compare
|
@bhperry: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Enhancement building on the tech preview version of automated etcd backups. Standardizes CRD names and updates them to be more flexible for different use-cases, including support for on-install backup schedules saved on local storage on each master node without using a separate backup mechanism.
CC @dusk125 @atiratree