Fix etcd client failover when an endpoint host crashes hard by jleal52 · Pull Request #3 · nixn/pdns-etcd3

jleal52 · 2026-04-25T06:53:24Z

Problem

When one of the configured ETCD endpoints becomes unreachable through a
hard failure (datacenter power loss, network partition without RST/FIN,
firewall drop), pdns-etcd3 can stop serving DNS even though a quorum
majority of the cluster is still healthy.

I hit this in production on a 3-node ETCD cluster spanning three DCs:
DC1 went down by complete power-off, leaving a healthy 2/3 quorum on
DC3+DC4. etcdctl confirmed both surviving endpoints reported
healthy=true and the leader was elected. Yet the PowerDNS authoritative
server backed by pdns-etcd3 kept returning SERVFAIL for every record
in the served zone until the dead endpoint was manually removed from the
backend config and the process restarted.

Root cause

The explicit clientv3.Config constructed in src/etcd.go only sets
DialTimeout and Endpoints:

cfg := clientv3.Config{
    DialTimeout: *args.DialTimeout,
    Endpoints:   strings.Split(*args.Endpoints, `|`),
}

This leaves three fields at their zero values, with the following
consequences when an endpoint host disappears without sending TCP
teardown:

DialKeepAliveTime / DialKeepAliveTimeout = 0 — gRPC never sends
HTTP/2 PINGs over the multiplexed session, so a connection stuck in
ESTABLISHED to a now-dead host is not detected as broken until the
kernel's tcp_retries2 expires (~13–15 min on Linux). The gRPC
balancer cannot mark the endpoint unhealthy until then, so RPCs keep
being scheduled on the blackholed connection.
PermitWithoutStream = false — even if keepalives were enabled,
they would only fire while an active RPC stream existed. For a backend
whose ETCD interaction is mostly an idle long-lived Watch plus
occasional Gets, this matters.
AutoSyncInterval = 0 — the client never refreshes the cluster
member list, so removed members stay in the rotation forever.

DialTimeout does not save us: it only guards the initial Dial, not
subsequent RPCs over the already-established session.

The pipe-mode failure mode amplifies the symptom into permanent
SERVFAIL: when populateData times out against the dead endpoint it
panics, recoverPanics does os.Exit(1), PowerDNS respawns the binary,
and the cycle repeats on every request.

Fix

Set sensible defaults on the explicit Config:

Field	Value	Why
`DialKeepAliveTime`	10s	gRPC pings every 10s on idle connections.
`DialKeepAliveTimeout`	5s	Window to declare a connection dead.
`PermitWithoutStream`	true	Pings fire even with no active RPC, important for the long-lived Watch case.
`AutoSyncInterval`	60s	Refresh member list from the cluster.

These match the values widely recommended in the etcd ecosystem (e.g.
the etcd operator and most production deployments).

The config-file branch (clientv3yaml.NewConfig) is intentionally left
untouched: those users declare these fields explicitly in their YAML and
should not be overridden.

Reproduction

3-node ETCD cluster, configure pdns-etcd3 with all three endpoints.
Power-off the host serving the first endpoint in the list (no
graceful shutdown, no RST).
Issue a DNS query backed by pdns-etcd3.

Without the fix: queries hang or return SERVFAIL for many minutes
until the kernel finally tears the dead TCP session down. With the fix:
the dead endpoint is detected within DialKeepAliveTime + DialKeepAliveTimeout
(~15s) and the balancer rotates to a healthy peer.

Notes

No new CLI flags. If you'd prefer them parameterized (e.g.
-dial-keepalive-time, -auto-sync-interval) I'm happy to follow up
with a separate PR matching the existing flag pattern.
Unit tests pass (go test -tags unit ./src). The behavior under
endpoint failure is hard to exercise in testcontainers without
network-namespace plumbing; I verified the production scenario
manually.
Upstream etcd v3.5.x recommends these settings explicitly in the
client docs.

When an endpoint host crashes hard (power loss, network partition without RST/FIN) the gRPC connection stays in TCP ESTABLISHED until the kernel times out (~13 min on Linux), so the balancer never marks the endpoint unhealthy and Get/Watch can hang against the dead peer. DialTimeout only guards the initial Dial, not subsequent RPCs over the multiplexed HTTP/2 session. Set sensible defaults on the explicit Config: - DialKeepAliveTime = 10s - DialKeepAliveTimeout = 5s - PermitWithoutStream = true (so keepalives fire even with no active RPC, important for the watch) - AutoSyncInterval = 60s (refresh member list from the cluster) Users supplying an etcd config file are unaffected: those values come from the YAML and the file branch is left untouched.

jleal52 · 2026-04-25T06:53:35Z

On a separate note: I'm sorry I never got around to finishing the ALIAS
implementation properly when I opened PR #2 — life got in the way and I
couldn't give it the attention it deserved. Thank you for picking it up
and implementing it cleanly in 5192689; the result is much better than
what I had drafted. Really appreciate the work you put into this
project.

nixn · 2026-04-25T09:02:41Z

Thank you for this PR and the investigation of the problem!

Setting the defaults looks good to me, but there is no point to define a defaultSomething without being able to override it by some runtime flag (not speaking of the config file). Therefore please add the appropriate CLI flags directly into this PR, they just belong here. It also enables to reproduce and test other settings (also the currently effective ones) after applying the PR.

For PermitWithoutStream you left out the default setting variable, but it should be handled the same way (together with a CLI flag).

It would be nice to have an integration test on this, thinking of just spinning up three ETCD containers about the same way how it is done currently (but with clustering set up), then killing the first configured one (as you described in the problem scenario). But it does not need to be part of this PR, just should be done at some time.

On the life thing, I know this kind of annoying circumstances, they constantly keep popping up and interrupting the developers' lives...

Follow-up to the previous commit: keep the same defaults but make every new field overridable both via a CLI flag (standalone mode) and via the remote-connection-string in pipe mode, matching the pattern already used by -timeout / timeout=. New parameters: -dial-keep-alive-time=<duration> (default 10s, 0 disables) -dial-keep-alive-timeout=<duration> (default 5s) -auto-sync-interval=<duration> (default 60s, 0 disables) -permit-without-stream=<bool> (default true) Also lift the inline PermitWithoutStream=true to a defaultPermitWithoutStream constant for consistency with the other defaults.

jleal52 · 2026-04-25T10:12:40Z

Pushed 28c8329 with the requested changes:

Added CLI flags -dial-keep-alive-time, -dial-keep-alive-timeout, -auto-sync-interval, -permit-without-stream, mirroring the existing pattern used for -timeout. They are also accepted via the pipe-mode connection string (dial-keep-alive-time=… etc.) by extending readParameters.
PermitWithoutStream now goes through a defaultPermitWithoutStream = true constant alongside the others, no more inline literal.
Defaults unchanged from the previous commit.
0 is accepted on the duration flags so the keepalive / auto-sync paths can be turned off explicitly to A/B-test settings, as you mentioned.

Quick verification: go build, go vet, gofmt -l and go test -tags unit ./src all clean locally.

On the integration test: agreed it would be nice to spin up three etcd containers in a real cluster and hard-kill the first endpoint to assert recovery time. Happy to do that as a follow-up PR rather than holding this one — let me know if you'd prefer otherwise.

The previous commit added new pointer fields to programArgs that are populated through flag.* in main(), so the standalone path is fine. The TestPipeRequests literal builds programArgs by hand and was missing the new fields, leading to a nil pointer dereference in etcdClient.Setup when the test bypasses main().

nixn · 2026-04-27T10:13:40Z

Please add documentation to README.md for the new parameters, too.

jleal52 · 2026-04-28T10:40:01Z

Pushed ece4808 with the README documentation:

Updated the pipe-mode remote-connection-string example to include dial-keep-alive-time, dial-keep-alive-timeout, auto-sync-interval and permit-without-stream.
Added an entry per parameter under the Parameters section, in the same style as the existing timeout entry: short description of what the knob does, why it matters (endpoint failover / cluster member discovery / idle-detection), how to disable it where applicable, and the default value.

Let me know if you'd like the wording or placement tweaked.

documented new etcd client parameters in README

ece4808

nixn merged commit f83ab4d into nixn:master Apr 28, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix etcd client failover when an endpoint host crashes hard#3

Fix etcd client failover when an endpoint host crashes hard#3
nixn merged 4 commits into
nixn:masterfrom
jleal52:fix/etcd-client-failover

jleal52 commented Apr 25, 2026

Uh oh!

jleal52 commented Apr 25, 2026

Uh oh!

nixn commented Apr 25, 2026

Uh oh!

jleal52 commented Apr 25, 2026

Uh oh!

nixn commented Apr 27, 2026

Uh oh!

jleal52 commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jleal52 commented Apr 25, 2026

Problem

Root cause

Fix

Reproduction

Notes

Uh oh!

jleal52 commented Apr 25, 2026

Uh oh!

nixn commented Apr 25, 2026

Uh oh!

jleal52 commented Apr 25, 2026

Uh oh!

nixn commented Apr 27, 2026

Uh oh!

jleal52 commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants