Kubernetes: ensuring High Availability for Pods
Setting up High Availability for Kubernetes Pods with Deployment replicas, Pod Topology Spread Constraints, PodDisruptionBudget and annotations for Ka
We have a Kubernetes cluster, where WorkerNodes are scaled by Karpenter, and Karpenter has the disruption.consolidationPolicy=WhenUnderutilized
parameter for its NodePool, and this means, that Karpenter will try to "consolidate" the placement of pods on Nodes in order to maximize the use of CPU and Memory resources.
In general, everything works, but this leads to the fact that WorkerNodes are sometimes recreated, and this causes our Pods to be "migrated" to other nodes.
So, the task now is to make sure that scaling and the consolidation process do not cause interruptions in the operation of our services.
Actually, this topic is not so much about Karpenter itself as it is about ensuring the stability of Pods in Kubernetes in general. But I faced this during Karpenter use, so we will talk a little about it as well.
Karpenter Disruption Flow
To better understand what’s happening with our Pods, let’s take a quick look at how Karpenter removes a WorkerNode from the pool. See Termination Controller.
After Karpenter discovered that there were nodes that needed to be terminated, he:
adds a finalizer on a Kubernetes WorkerNode
adds the
karpenter.sh/disruption:NoSchedule
taint on such a Node so that Kubernetes does not create new Pods on this Nodeif necessary, creates a new Node to which it will move the Pods from the Node that will be taken out of service (or uses an existing Node if it can accept additional Pods according to their
requests
)performs Pod Eviction of the Pods from the Node (see Safely Drain a Node and API-initiated Eviction)
after all Pods except DaemonSets are removed from the Node, Karpenter deletes the corresponding NodeClaim
removes the finalizer from the Node, which allows Kubernetes to delete the Node
Kubernetes Pod Eviction Flow
And briefly, the process of how Kubernetes itself performs the Pods Eviction:
The Server API receives an Eviction request and checks whether this Pod can be evicted (for example, whether its eviction will not violate the restrictions of a PodDisruptionBudget — we will speak about PodDisruptionBudgets later in this post)
marks the resource of this Pod for deletion
kubelet
starts the gracefully shut down process - that is, sends theSIGTERM
signalKubernetes removes the IP of this Pod from the list of endpoints
if the Pod has not stopped within the specified time, then
kubelet
sends aSIGKILL
signal to kill the process immediatelykubelet
sends a signal to the Server API that the Pod can be removed from the list of objectsAPI Server removes the Pod from the database
See How API-initiated eviction works and Pod Lifecycle — Termination of Pods.
Kubernetes Pod High Availability Options
So, what can we do with Pods to make our service work without interruption, regardless of the Karpenter’s activities?
have at least 2 Pods on critical services
to have Pod Topology Spread Constraints so that Pods are placed on different WorkerNodes — then if one Node with the first Pod is killed, another Pod on another Wode will stay alive
have a PodDisruptionBudget so that at least 1 Pod is always alive — this will prevent Karpenter from evicting all the Pods at once, because it monitors compliance with the PDB
and to guarantee that Pod Eviction will not be performed, we can set the
karpenter.sh/do-not-disrupt
Pod annotation - then Karpenter will ignore this Pod (and, accordingly, the Node on which such a Pod will be run)
Let’s take a look at these options in more detail.
Kubernetes Deployment replicas
The simplest and most obvious solution is to have at least 2 simultaneously working Pods.
Although this does not guarantee that Kubernetes will not evict them at the same time, it is a minimum condition for further actions.
So either run kubectl scale deployment --replicas=2
manually, or update the replicas
field in a Deployment/StatefulSets/ReplicaSet (see Workload Resources):
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo-deployment
spec:
replicas: 2
selector:
matchLabels:
app: nginx-demo
template:
metadata:
labels:
app: nginx-demo
spec:
containers:
- name: nginx-demo-container
image: nginx:latest
ports:
- containerPort: 80
Pod Topology Spread Constraints
I already wrote about this in the Pod Topology Spread Constraints, but in short, we can set the rules for placing Kubernetes Pods so that they are on different WorkerNodes. This way, when Karpenter wants to take one Node out of service, we will have a Pod on another node.
However, no one can prevent Karpenter from draining both Nodes at the same time, so this is not a 100% guarantee, but it is the second condition for ensuring the stability of our service.
In addition, with the Pod Topology Spread Constraints, we can specify the placement of Pods in different Availability Zones, which is a must-have option when building a High-Availability architecture.
So we add a topologySpreadConstraints
to our Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo-deployment
spec:
replicas: 2
selector:
matchLabels:
app: nginx-demo
template:
metadata:
labels:
app: nginx-demo
spec:
containers:
- name: nginx-demo-container
image: nginx:latest
ports:
- containerPort: 80
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx-demo
And now both Pods will be scheduled on a different WorkerNodes:
$ kk get pod -l app=nginx-demo -o json | jq '.items[].spec.nodeName'
"ip-10-1-54-144.ec2.internal"
"ip-10-1-45-7.ec2.internal"
Kubernetes PodDisruptionBudget
With the PodDisruptionBudget, we can set a rule for the minimum number of available or maximum number of unavailable Pods. The value can be either a number or a percentage of the total number of Pods in the replicas
of a Deployment/StatefulSets/ReplicaSet.
In the case of a Deployment that has two Pods and has a topologySpreadConstraints
on different WorkerNodes, this will ensure that Karpenter will not perform Node Drain on two WorkerNodes at the same time. Instead, it will "relocate" one Pod first, kill its Node, and then repeat the process for the other Node.
See Specifying a Disruption Budget for your Application.
Let's create a PDB for our Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo-deployment
spec:
replicas: 2
selector:
matchLabels:
app: nginx-demo
template:
metadata:
labels:
app: nginx-demo
spec:
containers:
- name: nginx-demo-container
image: nginx:latest
ports:
- containerPort: 80
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx-demo
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-demo-pdb
spec:
minAvailable: 50%
selector:
matchLabels:
app: nginx-demo
Deploy and check:
$ kk get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nginx-demo-pdb 50% N/A 1 21s
The karpenter.sh/do-not-disrupt
annotation
In addition to the settings on the Kubernetes side, we can explicitly prohibit the deletion of a Pod by Karpenter itself by adding the karpenter.sh/do-not-disrupt
annotation (previously, before Beta, these were karpenter.sh/do-not-evict
and karpenter.sh/do-not-consolidate
annotations).
This may be necessary, for example, for Pods that are to be run in a single instance (like VictoriaMetrics VMSingle instance) and that you do not want to stop.
To do this, add an annotation
to the template
of this Pod:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo-deployment
spec:
replicas: 1
selector:
matchLabels:
app: nginx-demo
template:
metadata:
labels:
app: nginx-demo
annotations:
karpenter.sh/do-not-disrupt: "true"
spec:
containers:
- name: nginx-demo-container
image: nginx:latest
ports:
- containerPort: 80
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx-demo
See Pod-Level Controls. In general, these seem to be all the main solutions that will help ensure the continuous operation of the Pods.
Originally published at RTFM: Linux, DevOps, and system administration.