Fix etcd client failover when an endpoint host crashes hard#3
Conversation
When an endpoint host crashes hard (power loss, network partition without
RST/FIN) the gRPC connection stays in TCP ESTABLISHED until the kernel
times out (~13 min on Linux), so the balancer never marks the endpoint
unhealthy and Get/Watch can hang against the dead peer. DialTimeout only
guards the initial Dial, not subsequent RPCs over the multiplexed HTTP/2
session.
Set sensible defaults on the explicit Config:
- DialKeepAliveTime = 10s
- DialKeepAliveTimeout = 5s
- PermitWithoutStream = true (so keepalives fire even with no
active RPC, important for the watch)
- AutoSyncInterval = 60s (refresh member list from the cluster)
Users supplying an etcd config file are unaffected: those values come
from the YAML and the file branch is left untouched.
|
On a separate note: I'm sorry I never got around to finishing the |
|
Thank you for this PR and the investigation of the problem! Setting the defaults looks good to me, but there is no point to define a For It would be nice to have an integration test on this, thinking of just spinning up three ETCD containers about the same way how it is done currently (but with clustering set up), then killing the first configured one (as you described in the problem scenario). But it does not need to be part of this PR, just should be done at some time. On the life thing, I know this kind of annoying circumstances, they constantly keep popping up and interrupting the developers' lives... |
Follow-up to the previous commit: keep the same defaults but make every new field overridable both via a CLI flag (standalone mode) and via the remote-connection-string in pipe mode, matching the pattern already used by -timeout / timeout=. New parameters: -dial-keep-alive-time=<duration> (default 10s, 0 disables) -dial-keep-alive-timeout=<duration> (default 5s) -auto-sync-interval=<duration> (default 60s, 0 disables) -permit-without-stream=<bool> (default true) Also lift the inline PermitWithoutStream=true to a defaultPermitWithoutStream constant for consistency with the other defaults.
|
Pushed 28c8329 with the requested changes:
Quick verification: On the integration test: agreed it would be nice to spin up three etcd containers in a real cluster and hard-kill the first endpoint to assert recovery time. Happy to do that as a follow-up PR rather than holding this one — let me know if you'd prefer otherwise. |
The previous commit added new pointer fields to programArgs that are populated through flag.* in main(), so the standalone path is fine. The TestPipeRequests literal builds programArgs by hand and was missing the new fields, leading to a nil pointer dereference in etcdClient.Setup when the test bypasses main().
|
Please add documentation to README.md for the new parameters, too. |
|
Pushed ece4808 with the README documentation:
Let me know if you'd like the wording or placement tweaked. |
Problem
When one of the configured ETCD endpoints becomes unreachable through a
hard failure (datacenter power loss, network partition without RST/FIN,
firewall drop),
pdns-etcd3can stop serving DNS even though a quorummajority of the cluster is still healthy.
I hit this in production on a 3-node ETCD cluster spanning three DCs:
DC1 went down by complete power-off, leaving a healthy 2/3 quorum on
DC3+DC4.
etcdctlconfirmed both surviving endpoints reportedhealthy=trueand the leader was elected. Yet the PowerDNS authoritativeserver backed by
pdns-etcd3kept returningSERVFAILfor every recordin the served zone until the dead endpoint was manually removed from the
backend config and the process restarted.
Root cause
The explicit
clientv3.Configconstructed insrc/etcd.goonly setsDialTimeoutandEndpoints:This leaves three fields at their zero values, with the following
consequences when an endpoint host disappears without sending TCP
teardown:
DialKeepAliveTime/DialKeepAliveTimeout= 0 — gRPC never sendsHTTP/2 PINGs over the multiplexed session, so a connection stuck in
ESTABLISHEDto a now-dead host is not detected as broken until thekernel's
tcp_retries2expires (~13–15 min on Linux). The gRPCbalancer cannot mark the endpoint unhealthy until then, so RPCs keep
being scheduled on the blackholed connection.
PermitWithoutStream= false — even if keepalives were enabled,they would only fire while an active RPC stream existed. For a backend
whose ETCD interaction is mostly an idle long-lived
Watchplusoccasional Gets, this matters.
AutoSyncInterval= 0 — the client never refreshes the clustermember list, so removed members stay in the rotation forever.
DialTimeoutdoes not save us: it only guards the initialDial, notsubsequent RPCs over the already-established session.
The pipe-mode failure mode amplifies the symptom into permanent
SERVFAIL: whenpopulateDatatimes out against the dead endpoint itpanics,
recoverPanicsdoesos.Exit(1), PowerDNS respawns the binary,and the cycle repeats on every request.
Fix
Set sensible defaults on the explicit
Config:DialKeepAliveTimeDialKeepAliveTimeoutPermitWithoutStreamAutoSyncIntervalThese match the values widely recommended in the etcd ecosystem (e.g.
the etcd operator and most production deployments).
The config-file branch (
clientv3yaml.NewConfig) is intentionally leftuntouched: those users declare these fields explicitly in their YAML and
should not be overridden.
Reproduction
pdns-etcd3with all three endpoints.graceful shutdown, no RST).
pdns-etcd3.Without the fix: queries hang or return
SERVFAILfor many minutesuntil the kernel finally tears the dead TCP session down. With the fix:
the dead endpoint is detected within
DialKeepAliveTime + DialKeepAliveTimeout(~15s) and the balancer rotates to a healthy peer.
Notes
-dial-keepalive-time,-auto-sync-interval) I'm happy to follow upwith a separate PR matching the existing flag pattern.
go test -tags unit ./src). The behavior underendpoint failure is hard to exercise in
testcontainerswithoutnetwork-namespace plumbing; I verified the production scenario
manually.
client docs.