Skip to content

DNS Failover via Groups — Tech Preview #776

Description

@philbrookes

User Experience

Today in developer preview, the user can failover between clusters by running CLI commands to update the active groups TXT record. However, group transitions can stall silently due to a premature reconcile bug, registry entries may be inaccurate after transitions, there is no way to preview the impact of removing a group before executing it, and no metrics exist to confirm failover completed — leaving the user reliant on polling DNSRecord statuses and inspecting DNS directly.

In tech preview, these gaps are resolved: group status changes propagate immediately, registry entries stay accurate throughout transitions, and the CLI previews affected endpoints before a group removal is confirmed. Prometheus metrics expose group state and cleanup operations, giving the user clear observability into failover progress. The result is that when a cluster goes down, the user runs a single CLI command to activate the standby group, watches metrics and record statuses converge, and has confidence the failover completed cleanly.

User Stories

1. Reliable group transitions with E2E validation

As a cluster operator, I want group status changes to propagate immediately, registry entries to stay accurate across active/inactive transitions, and the CLI to preview impact before removing a group, so that failover is reliable, predictable, and safe — validated by E2E tests against real DNS providers.

Acceptance Criteria: Premature reconcile check no longer blocks group/status updates (#664). Inactive controllers always write accurate groupID to provider registry (#637). remove-active-group CLI shows affected endpoints before execution (#663, #620). E2E tests cover group assignment, transitions, and cleanup against real providers (#665).

Tasks:

2. Group-aware observability

As an SRE, I want Prometheus metrics exposing each record's group assignment, active/inactive state, and inactive group cleanup operations, so that I can observe failover progress and build alerts.

Acceptance Criteria: Metric for current group assignment per DNSRecord. Metric for active/inactive state per DNSRecord. Metric for inactive group cleanup operations. Metrics documented with example Grafana queries or recording rules.

3. Update documentation for tech preview

As a cluster operator, I want the failover docs to reflect the resolved known issues, updated reconcile behaviour, and available metrics, so that the documentation accurately represents the tech preview state of the feature.

Acceptance Criteria: Remove known issue #637 from migration docs once resolved. Update "dev preview" references to "tech preview". Document new metrics and observability options. Document CLI impact preview capability.

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions