Hello :)
Description
I'm experiencing frequent crashes of the Doppler operator across 5 different OVH managed Kubernetes clusters. The operator loses leader election and restarts every few hours due to API server timeout issues.
Error Message
E1122 12:01:41.275002 1 leaderelection.go:361] Failed to update lock: Put "https://10.3.0.1:443/api/v1/namespaces/doppler-operator-system/configmaps/f39fa519.doppler.com": context deadline exceeded
I1122 12:01:41.275125 1 leaderelection.go:278] failed to renew lease doppler-operator-system/xxx.doppler.com: timed out waiting for the condition
2025-11-22T12:01:41.275Z ERROR setup problem running manager {"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:144
main.main
/workspace/main.go:103
runtime.main
/usr/local/go/src/runtime/proc.go:250
Environment
- Doppler Operator version: 1.5.1
- Kubernetes version: 1.30.14
- Cluster type: OVH Managed Kubernetes (affecting 5 different production clusters)
- Pod Resources: 100m CPU / 256Mi RAM
What I've Tried
- ✅ Reduced API server load by optimizing other operators (Velero sync periods from 1m → 10m)
- ✅ Verified the operator has sufficient CPU/memory resources
- ❌ Tried to increase leader election timeouts but the flags aren't exposed
Impact
The operator restarts every few hours, causing:
- Brief interruptions in secret synchronization
- Alert noise and operational overhead
- Concerns about reliability in production
Suggested Fix
Could you expose the standard controller-runtime leader election flags? This would let me test if increasing the timeouts resolves the issue on managed Kubernetes platforms:
--leader-elect-lease-duration (default: 15s)
--leader-elect-renew-deadline (default: 10s)
--leader-elect-retry-period (default: 2s)
The current 10s deadline seems too aggressive for managed clusters where API server latency can occasionally spike. Being able to configure these values (e.g., 30s/20s/5s) would help determine if this is just a timing issue or a deeper problem.
Additional Context
This issue appears specific to managed Kubernetes environments where we don't control the API server performance.
Happy to provide more logs or help test a fix if needed!
Hello :)
Description
I'm experiencing frequent crashes of the Doppler operator across 5 different OVH managed Kubernetes clusters. The operator loses leader election and restarts every few hours due to API server timeout issues.
Error Message
Environment
What I've Tried
Impact
The operator restarts every few hours, causing:
Suggested Fix
Could you expose the standard controller-runtime leader election flags? This would let me test if increasing the timeouts resolves the issue on managed Kubernetes platforms:
The current 10s deadline seems too aggressive for managed clusters where API server latency can occasionally spike. Being able to configure these values (e.g., 30s/20s/5s) would help determine if this is just a timing issue or a deeper problem.
Additional Context
This issue appears specific to managed Kubernetes environments where we don't control the API server performance.
Happy to provide more logs or help test a fix if needed!