Skip to content

Leader election fails on managed Kubernetes clusters - need configurable timeouts #95

Description

@cmer81

Hello :)

Description

I'm experiencing frequent crashes of the Doppler operator across 5 different OVH managed Kubernetes clusters. The operator loses leader election and restarts every few hours due to API server timeout issues.

Error Message

E1122 12:01:41.275002       1 leaderelection.go:361] Failed to update lock: Put "https://10.3.0.1:443/api/v1/namespaces/doppler-operator-system/configmaps/f39fa519.doppler.com": context deadline exceeded
I1122 12:01:41.275125       1 leaderelection.go:278] failed to renew lease doppler-operator-system/xxx.doppler.com: timed out waiting for the condition
2025-11-22T12:01:41.275Z    ERROR   setup   problem running manager {"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:144
main.main
    /workspace/main.go:103
runtime.main
    /usr/local/go/src/runtime/proc.go:250

Environment

  • Doppler Operator version: 1.5.1
  • Kubernetes version: 1.30.14
  • Cluster type: OVH Managed Kubernetes (affecting 5 different production clusters)
  • Pod Resources: 100m CPU / 256Mi RAM

What I've Tried

  • ✅ Reduced API server load by optimizing other operators (Velero sync periods from 1m → 10m)
  • ✅ Verified the operator has sufficient CPU/memory resources
  • ❌ Tried to increase leader election timeouts but the flags aren't exposed

Impact

The operator restarts every few hours, causing:

  • Brief interruptions in secret synchronization
  • Alert noise and operational overhead
  • Concerns about reliability in production

Suggested Fix

Could you expose the standard controller-runtime leader election flags? This would let me test if increasing the timeouts resolves the issue on managed Kubernetes platforms:

--leader-elect-lease-duration (default: 15s)
--leader-elect-renew-deadline (default: 10s)
--leader-elect-retry-period (default: 2s)

The current 10s deadline seems too aggressive for managed clusters where API server latency can occasionally spike. Being able to configure these values (e.g., 30s/20s/5s) would help determine if this is just a timing issue or a deeper problem.

Additional Context

This issue appears specific to managed Kubernetes environments where we don't control the API server performance.

Happy to provide more logs or help test a fix if needed!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions