kubeoos

Kubernetes controller that automatically adds the node.kubernetes.io/out-of-service taint to NotReady nodes, enabling immediate pod evacuation and volume detachment.

Why

When a Kubernetes node goes down, pods are marked for deletion but remain stuck in Terminating because the kubelet is unavailable to process the deletion. Volumes stay attached. Stateful workloads can't restart elsewhere.

Kubernetes 1.28+ supports the node.kubernetes.io/out-of-service taint which, when applied to a node, tells Kubernetes to force-detach volumes and allow pods to be deleted — but no built-in controller applies this taint automatically.

kubeoos fills this gap: it watches for NotReady nodes and, after a configurable timeout, applies the out-of-service taint. When the node recovers, the taint is removed.

How It Works

Node goes NotReady
       │
       ▼
  Wait timeout (default 5m)
       │
       ▼
  Safety check: are too many nodes already tainted?
       │ no
       ▼
  Add node.kubernetes.io/out-of-service:NoExecute taint
       │
       ▼
  Kubernetes force-detaches volumes, deletes pods
       │
       ▼
  Pods reschedule on healthy nodes
       │
       ▼
  Node comes back Ready → taint removed automatically

Features

Configurable timeout — wait N minutes before declaring a node out of service (default: 5m)
Safety threshold — won't taint more than N% of nodes (default: 49%), preventing cluster-wide evacuation
Automatic recovery — removes the taint when a node returns to Ready
Leader election — run multiple replicas for high availability
Control plane aware — skips control plane nodes
Minimal footprint — single binary, ~10m CPU / 32Mi memory

Install

Helm

helm install kubeoos oci://ghcr.io/migetapp/kubeoos/charts/kubeoos \
  --namespace kube-system

Helm with custom values

helm install kubeoos oci://ghcr.io/migetapp/kubeoos/charts/kubeoos \
  --namespace kube-system \
  --set args.notReadyTimeout=3m \
  --set args.maxUnhealthyPercent=30

Configuration

Parameter	Default	Description
`args.notReadyTimeout`	`5m`	How long a node must be NotReady before tainting
`args.maxUnhealthyPercent`	`49`	Max % of worker nodes that can be tainted (safety)
`args.checkInterval`	`30s`	Reconciliation interval
`args.logLevel`	`2`	klog verbosity (0-5)
`replicaCount`	`2`	Number of replicas (leader election ensures only one is active)
`leaderElection.enabled`	`true`	Enable leader election for HA

Comparison

Tool	Approach	Reboots node?	Needs agent on failing node?	Needs IPMI/BMC?
kubeoos	Adds out-of-service taint from healthy node	No	No	No
SNR (medik8s)	Self-remediation via watchdog/reboot	Yes	Yes	No
FAR (medik8s)	Fence agent power-cycles node	Yes	No	Yes
kube-fencing	STONITH via fence agents	Yes	No	Yes
draino	Graceful drain (cooperative)	No	No	No

Requirements

Kubernetes 1.28+ (for NodeOutOfServiceVolumeDetach feature gate, GA since 1.28)

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
charts/kubeoos		charts/kubeoos
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

kubeoos

Why

How It Works

Features

Install

Helm

Helm with custom values

Configuration

Comparison

Requirements

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

kubeoos

Why

How It Works

Features

Install

Helm

Helm with custom values

Configuration

Comparison

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages