DNS Failover via Groups — Tech Preview

## User Experience

Today in developer preview, the user can failover between clusters by running CLI commands to update the active groups TXT record. However, group transitions can stall silently due to a premature reconcile bug, registry entries may be inaccurate after transitions, there is no way to preview the impact of removing a group before executing it, and no metrics exist to confirm failover completed — leaving the user reliant on polling DNSRecord statuses and inspecting DNS directly.

In tech preview, these gaps are resolved: group status changes propagate immediately, registry entries stay accurate throughout transitions, and the CLI previews affected endpoints before a group removal is confirmed. Prometheus metrics expose group state and cleanup operations, giving the user clear observability into failover progress. The result is that when a cluster goes down, the user runs a single CLI command to activate the standby group, watches metrics and record statuses converge, and has confidence the failover completed cleanly.

## User Stories

### 1. Reliable group transitions with E2E validation

As a cluster operator, I want group status changes to propagate immediately, registry entries to stay accurate across active/inactive transitions, and the CLI to preview impact before removing a group, so that failover is reliable, predictable, and safe — validated by E2E tests against real DNS providers.

Acceptance Criteria: Premature reconcile check no longer blocks group/status updates (#664). Inactive controllers always write accurate groupID to provider registry (#637). `remove-active-group` CLI shows affected endpoints before execution (#663, #620). E2E tests cover group assignment, transitions, and cleanup against real providers (#665).

Tasks:
- [ ] #777 — Fix delegation bug hidden by premature reconcile check (5 points)
- [ ] #778 — Rework premature reconcile check to allow group/status propagation (3 points)
- [ ] #779 — Always write groupID to provider registry for inactive controllers (3 points)
- [ ] #780 — Integration tests for groupID registry write behaviour (2 points)
- [ ] #781 — Parse zone TXT registry records into group-aware data structure (3 points)
- [ ] #782 — CLI dry-run mode for remove-active-group (3 points)
- [ ] #783 — E2E tests for full group failover lifecycle (5 points)
- [ ] #784 — Validate group E2E tests across all supported providers (2 points)

### 2. Group-aware observability

As an SRE, I want Prometheus metrics exposing each record's group assignment, active/inactive state, and inactive group cleanup operations, so that I can observe failover progress and build alerts.

Acceptance Criteria: Metric for current group assignment per DNSRecord. Metric for active/inactive state per DNSRecord. Metric for inactive group cleanup operations. Metrics documented with example Grafana queries or recording rules.

### 3. Update documentation for tech preview

As a cluster operator, I want the failover docs to reflect the resolved known issues, updated reconcile behaviour, and available metrics, so that the documentation accurately represents the tech preview state of the feature.

Acceptance Criteria: Remove known issue #637 from migration docs once resolved. Update "dev preview" references to "tech preview". Document new metrics and observability options. Document CLI impact preview capability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DNS Failover via Groups — Tech Preview #776

User Experience

User Stories

1. Reliable group transitions with E2E validation

2. Group-aware observability

3. Update documentation for tech preview

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DNS Failover via Groups — Tech Preview #776

Description

User Experience

User Stories

1. Reliable group transitions with E2E validation

2. Group-aware observability

3. Update documentation for tech preview

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions