User Experience
Today in developer preview, the user can failover between clusters by running CLI commands to update the active groups TXT record. However, group transitions can stall silently due to a premature reconcile bug, registry entries may be inaccurate after transitions, there is no way to preview the impact of removing a group before executing it, and no metrics exist to confirm failover completed — leaving the user reliant on polling DNSRecord statuses and inspecting DNS directly.
In tech preview, these gaps are resolved: group status changes propagate immediately, registry entries stay accurate throughout transitions, and the CLI previews affected endpoints before a group removal is confirmed. Prometheus metrics expose group state and cleanup operations, giving the user clear observability into failover progress. The result is that when a cluster goes down, the user runs a single CLI command to activate the standby group, watches metrics and record statuses converge, and has confidence the failover completed cleanly.
User Stories
1. Reliable group transitions with E2E validation
As a cluster operator, I want group status changes to propagate immediately, registry entries to stay accurate across active/inactive transitions, and the CLI to preview impact before removing a group, so that failover is reliable, predictable, and safe — validated by E2E tests against real DNS providers.
Acceptance Criteria: Premature reconcile check no longer blocks group/status updates (#664). Inactive controllers always write accurate groupID to provider registry (#637). remove-active-group CLI shows affected endpoints before execution (#663, #620). E2E tests cover group assignment, transitions, and cleanup against real providers (#665).
Tasks:
2. Group-aware observability
As an SRE, I want Prometheus metrics exposing each record's group assignment, active/inactive state, and inactive group cleanup operations, so that I can observe failover progress and build alerts.
Acceptance Criteria: Metric for current group assignment per DNSRecord. Metric for active/inactive state per DNSRecord. Metric for inactive group cleanup operations. Metrics documented with example Grafana queries or recording rules.
3. Update documentation for tech preview
As a cluster operator, I want the failover docs to reflect the resolved known issues, updated reconcile behaviour, and available metrics, so that the documentation accurately represents the tech preview state of the feature.
Acceptance Criteria: Remove known issue #637 from migration docs once resolved. Update "dev preview" references to "tech preview". Document new metrics and observability options. Document CLI impact preview capability.
User Experience
Today in developer preview, the user can failover between clusters by running CLI commands to update the active groups TXT record. However, group transitions can stall silently due to a premature reconcile bug, registry entries may be inaccurate after transitions, there is no way to preview the impact of removing a group before executing it, and no metrics exist to confirm failover completed — leaving the user reliant on polling DNSRecord statuses and inspecting DNS directly.
In tech preview, these gaps are resolved: group status changes propagate immediately, registry entries stay accurate throughout transitions, and the CLI previews affected endpoints before a group removal is confirmed. Prometheus metrics expose group state and cleanup operations, giving the user clear observability into failover progress. The result is that when a cluster goes down, the user runs a single CLI command to activate the standby group, watches metrics and record statuses converge, and has confidence the failover completed cleanly.
User Stories
1. Reliable group transitions with E2E validation
As a cluster operator, I want group status changes to propagate immediately, registry entries to stay accurate across active/inactive transitions, and the CLI to preview impact before removing a group, so that failover is reliable, predictable, and safe — validated by E2E tests against real DNS providers.
Acceptance Criteria: Premature reconcile check no longer blocks group/status updates (#664). Inactive controllers always write accurate groupID to provider registry (#637).
remove-active-groupCLI shows affected endpoints before execution (#663, #620). E2E tests cover group assignment, transitions, and cleanup against real providers (#665).Tasks:
2. Group-aware observability
As an SRE, I want Prometheus metrics exposing each record's group assignment, active/inactive state, and inactive group cleanup operations, so that I can observe failover progress and build alerts.
Acceptance Criteria: Metric for current group assignment per DNSRecord. Metric for active/inactive state per DNSRecord. Metric for inactive group cleanup operations. Metrics documented with example Grafana queries or recording rules.
3. Update documentation for tech preview
As a cluster operator, I want the failover docs to reflect the resolved known issues, updated reconcile behaviour, and available metrics, so that the documentation accurately represents the tech preview state of the feature.
Acceptance Criteria: Remove known issue #637 from migration docs once resolved. Update "dev preview" references to "tech preview". Document new metrics and observability options. Document CLI impact preview capability.