Skip to content

Fix etcd client failover when an endpoint host crashes hard#3

Merged
nixn merged 4 commits into
nixn:masterfrom
jleal52:fix/etcd-client-failover
Apr 28, 2026
Merged

Fix etcd client failover when an endpoint host crashes hard#3
nixn merged 4 commits into
nixn:masterfrom
jleal52:fix/etcd-client-failover

Conversation

@jleal52

@jleal52 jleal52 commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Problem

When one of the configured ETCD endpoints becomes unreachable through a
hard failure (datacenter power loss, network partition without RST/FIN,
firewall drop), pdns-etcd3 can stop serving DNS even though a quorum
majority of the cluster is still healthy.

I hit this in production on a 3-node ETCD cluster spanning three DCs:
DC1 went down by complete power-off, leaving a healthy 2/3 quorum on
DC3+DC4. etcdctl confirmed both surviving endpoints reported
healthy=true and the leader was elected. Yet the PowerDNS authoritative
server backed by pdns-etcd3 kept returning SERVFAIL for every record
in the served zone until the dead endpoint was manually removed from the
backend config and the process restarted.

Root cause

The explicit clientv3.Config constructed in src/etcd.go only sets
DialTimeout and Endpoints:

cfg := clientv3.Config{
    DialTimeout: *args.DialTimeout,
    Endpoints:   strings.Split(*args.Endpoints, `|`),
}

This leaves three fields at their zero values, with the following
consequences when an endpoint host disappears without sending TCP
teardown:

  • DialKeepAliveTime / DialKeepAliveTimeout = 0 — gRPC never sends
    HTTP/2 PINGs over the multiplexed session, so a connection stuck in
    ESTABLISHED to a now-dead host is not detected as broken until the
    kernel's tcp_retries2 expires (~13–15 min on Linux). The gRPC
    balancer cannot mark the endpoint unhealthy until then, so RPCs keep
    being scheduled on the blackholed connection.
  • PermitWithoutStream = false — even if keepalives were enabled,
    they would only fire while an active RPC stream existed. For a backend
    whose ETCD interaction is mostly an idle long-lived Watch plus
    occasional Gets, this matters.
  • AutoSyncInterval = 0 — the client never refreshes the cluster
    member list, so removed members stay in the rotation forever.

DialTimeout does not save us: it only guards the initial Dial, not
subsequent RPCs over the already-established session.

The pipe-mode failure mode amplifies the symptom into permanent
SERVFAIL: when populateData times out against the dead endpoint it
panics, recoverPanics does os.Exit(1), PowerDNS respawns the binary,
and the cycle repeats on every request.

Fix

Set sensible defaults on the explicit Config:

Field Value Why
DialKeepAliveTime 10s gRPC pings every 10s on idle connections.
DialKeepAliveTimeout 5s Window to declare a connection dead.
PermitWithoutStream true Pings fire even with no active RPC, important for the long-lived Watch case.
AutoSyncInterval 60s Refresh member list from the cluster.

These match the values widely recommended in the etcd ecosystem (e.g.
the etcd operator and most production deployments).

The config-file branch (clientv3yaml.NewConfig) is intentionally left
untouched: those users declare these fields explicitly in their YAML and
should not be overridden.

Reproduction

  1. 3-node ETCD cluster, configure pdns-etcd3 with all three endpoints.
  2. Power-off the host serving the first endpoint in the list (no
    graceful shutdown, no RST).
  3. Issue a DNS query backed by pdns-etcd3.

Without the fix: queries hang or return SERVFAIL for many minutes
until the kernel finally tears the dead TCP session down. With the fix:
the dead endpoint is detected within DialKeepAliveTime + DialKeepAliveTimeout
(~15s) and the balancer rotates to a healthy peer.

Notes

  • No new CLI flags. If you'd prefer them parameterized (e.g.
    -dial-keepalive-time, -auto-sync-interval) I'm happy to follow up
    with a separate PR matching the existing flag pattern.
  • Unit tests pass (go test -tags unit ./src). The behavior under
    endpoint failure is hard to exercise in testcontainers without
    network-namespace plumbing; I verified the production scenario
    manually.
  • Upstream etcd v3.5.x recommends these settings explicitly in the
    client docs.

When an endpoint host crashes hard (power loss, network partition without
RST/FIN) the gRPC connection stays in TCP ESTABLISHED until the kernel
times out (~13 min on Linux), so the balancer never marks the endpoint
unhealthy and Get/Watch can hang against the dead peer. DialTimeout only
guards the initial Dial, not subsequent RPCs over the multiplexed HTTP/2
session.

Set sensible defaults on the explicit Config:
  - DialKeepAliveTime    = 10s
  - DialKeepAliveTimeout = 5s
  - PermitWithoutStream  = true   (so keepalives fire even with no
                                   active RPC, important for the watch)
  - AutoSyncInterval     = 60s    (refresh member list from the cluster)

Users supplying an etcd config file are unaffected: those values come
from the YAML and the file branch is left untouched.
@jleal52

jleal52 commented Apr 25, 2026

Copy link
Copy Markdown
Contributor Author

On a separate note: I'm sorry I never got around to finishing the ALIAS
implementation properly when I opened PR #2 — life got in the way and I
couldn't give it the attention it deserved. Thank you for picking it up
and implementing it cleanly in 5192689; the result is much better than
what I had drafted. Really appreciate the work you put into this
project.

@nixn

nixn commented Apr 25, 2026

Copy link
Copy Markdown
Owner

Thank you for this PR and the investigation of the problem!

Setting the defaults looks good to me, but there is no point to define a defaultSomething without being able to override it by some runtime flag (not speaking of the config file). Therefore please add the appropriate CLI flags directly into this PR, they just belong here. It also enables to reproduce and test other settings (also the currently effective ones) after applying the PR.

For PermitWithoutStream you left out the default setting variable, but it should be handled the same way (together with a CLI flag).

It would be nice to have an integration test on this, thinking of just spinning up three ETCD containers about the same way how it is done currently (but with clustering set up), then killing the first configured one (as you described in the problem scenario). But it does not need to be part of this PR, just should be done at some time.

On the life thing, I know this kind of annoying circumstances, they constantly keep popping up and interrupting the developers' lives...

Follow-up to the previous commit: keep the same defaults but make every
new field overridable both via a CLI flag (standalone mode) and via the
remote-connection-string in pipe mode, matching the pattern already used
by -timeout / timeout=.

New parameters:
  -dial-keep-alive-time=<duration>      (default 10s, 0 disables)
  -dial-keep-alive-timeout=<duration>   (default 5s)
  -auto-sync-interval=<duration>        (default 60s, 0 disables)
  -permit-without-stream=<bool>         (default true)

Also lift the inline PermitWithoutStream=true to a defaultPermitWithoutStream
constant for consistency with the other defaults.
@jleal52

jleal52 commented Apr 25, 2026

Copy link
Copy Markdown
Contributor Author

Pushed 28c8329 with the requested changes:

  • Added CLI flags -dial-keep-alive-time, -dial-keep-alive-timeout, -auto-sync-interval, -permit-without-stream, mirroring the existing pattern used for -timeout. They are also accepted via the pipe-mode connection string (dial-keep-alive-time=… etc.) by extending readParameters.
  • PermitWithoutStream now goes through a defaultPermitWithoutStream = true constant alongside the others, no more inline literal.
  • Defaults unchanged from the previous commit.
  • 0 is accepted on the duration flags so the keepalive / auto-sync paths can be turned off explicitly to A/B-test settings, as you mentioned.

Quick verification: go build, go vet, gofmt -l and go test -tags unit ./src all clean locally.

On the integration test: agreed it would be nice to spin up three etcd containers in a real cluster and hard-kill the first endpoint to assert recovery time. Happy to do that as a follow-up PR rather than holding this one — let me know if you'd prefer otherwise.

The previous commit added new pointer fields to programArgs that are
populated through flag.* in main(), so the standalone path is fine. The
TestPipeRequests literal builds programArgs by hand and was missing the
new fields, leading to a nil pointer dereference in etcdClient.Setup
when the test bypasses main().
@nixn

nixn commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Please add documentation to README.md for the new parameters, too.

@jleal52

jleal52 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor Author

Pushed ece4808 with the README documentation:

  • Updated the pipe-mode remote-connection-string example to include dial-keep-alive-time, dial-keep-alive-timeout, auto-sync-interval and permit-without-stream.
  • Added an entry per parameter under the Parameters section, in the same style as the existing timeout entry: short description of what the knob does, why it matters (endpoint failover / cluster member discovery / idle-detection), how to disable it where applicable, and the default value.

Let me know if you'd like the wording or placement tweaked.

@nixn nixn merged commit f83ab4d into nixn:master Apr 28, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants