Kubernetes: Liveness and Readiness Probes — Best practices
Tips for using livenessProbe, readinessProbe, and startupProbe in Kubernetes
Some useful tips on using Liveness and Readiness Probes in Kubernetes — the difference between them, and how to properly configure these checks.
To put it very briefly:
livenessProbe : is used by Kubernetes to know when to perform a Pod restart
readinessProbe : is used by Kubernetes to know when a container is ready to receive traffic, that is, when the corresponding Kubernetes Service can add this Pod to its routes
startupProbe : is used by Kubernetes to know when a container has started and is ready to perform checks with
livenessProbe
andreadinessProbe
livenessProbe
andreadinessProbe
will start executing only after a successfulstartupProbe
check
So, livenessProbe is used to determine whether the process in the event is alive, while readinessProbe is used to determine whether the service in the event is ready to receive traffic, and startupProbe is used to determine when to start executing livenessProbe
and readinessProbe
.
This post is based on three materials that I once saved and use quite a bit:
Kubernetes production best practices: useful tips not only about Probes, but about Kubernetes in general
Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot: good examples of creating Probes and how to avoid mistakes when working with them
Kubernetes: Best Practices for Liveness Probes: some nuances when working with Probes
And there will be a few more links in the end of this post.
livenessProbe
The livenessProbe
is needed when, for example, a process is stuck in a deadlock and cannot perform its tasks. Another example is if a process has entered an infinite loop and is using 100% of the CPU, while being unable to process requests from clients because it is still connected to the Kubernetes network.
If you have a readinessProbe
but no livenessProbe
, then such a Pod will be disconnected from traffic, but will remain in the Running status, and will continue to occupy CPU/Memory resources.
Requests to livenessProbe
are executed by the kubelet
process on the same WorkerNode where the container is running, and after restarting, the sub will be created on the same WorkerNode
A process in a container should stop with an error code
livenessProbe
should not be a tool for responding to service errors: instead, the process should finish its execution with an error code, which will stop the container/Pod, and create a new one.
livenessProbe
is used only to check the status of the process itself in the container.
Split Liveness and Readiness Probes
It is a common practice to use one endpoint for livenessProbe
and readinessProbe
, but set a higher value of failureThreshold
for the livenessProbe
, that is, disconnect traffic and customers earlier, and if things are really bad, then restart.
But these Probes have different purposes, and therefore, although it is acceptable to use the same endpoint, it is better to have different checks. In addition, if both checks fail, Kubernetes will restart a Pod and disconnect it from the network at the same time, which can lead to 502 errors for clients.
Avoid dependencies
Pods should not refer to each other or to external services when running livenessProbe
: your container should not perform database server availability checks, because if the database server is down, restarting your Pod will not help solve this problem.
Instead, you can create a separate endpoint for the monitoring system and perform such checks there - for alerts and dashboards in Grafana.
In addition, a process in a container should not crash if it cannot access the service it depends on. Instead, it should perform a connection retry, because Kubernetes expects pods to be run in any order.
Correct processing of the SIGTERM signal
A process in a container must correctly handle the SIGTERM
signal - it is sent from the kubelet
to the containers when they need to be restarted in the case when livenessProbe
is failed. If there was no response to the SIGTERM
(because the process is "hanging"), then a SIGKILL
will be sent.
Or the process can perceive SIGTERM
as SIGKILL
, and stop without closing open TCP connections - see Kubernetes: NGINX/PHP-FPM graceful shutdown and 502 errors.
readinessProbe
The readinessProbe
is needed to not send requests to Pods that are still spinning up, and are not ready to handle requests from users.
For example, if your Pod startup process takes 2 minutes (some kind of bootstrap, especially if it is a JVM, or loading some cache into memory), and you do not have a readinessProbe
, then Kubernetes will start sending requests as soon as it enters the Running status, and they will fail.
Checking dependencies
Unlike livenessProbe
, in readinessProbe
it may make sense to check the availability of services on which the Pod depends, because if the service cannot fulfill a request from a client because it does not have a connection to the database, then you do not need to allow traffic to this Pod.
However, keep in mind that readinessProbe
is executed continuously (every 15 seconds by default), and a separate database query will be executed for each such check.
But in general, it depends on your application. For example, if a database server is down, but you can return responses from some local cache then the app can continue to work, and return a 503 error for requests with write operations.
startupProbe
Since a startupProbe
is executed only at the start of the pod, this is where you can check connections to external services or cache access.
For example, it can be useful to check the database connection when you deploy a new version of the Helm Chart and have a Kubernetes Deployment with Rolling Update, but the new version has an error in the URL or password to the database server.
Also, startupProbe
can be useful to not increase the initialDelaySeconds
for livenessProbe
and readinessProbe
, and instead delay their start until the startupProbe
is finished, because if the livenessProbe
does not have time to complete when the container starts, then Kubernetes will restart the pod, even though it is still "warming up".
Types of checks
In each Probe, we can use checks by:
exec
: execute the command inside the containerhttpGet
: execute the HTTP GET requesttcpSocket
: open TCP connect to the portgrpc
: make a gRPC request to a TCP port
Parameters for Probes
All Probes have parameters that allow you to fine-tune the time of the checks:
initialDelaySeconds
: delay between container start and start of checksperiodSeconds
: how often after initialDelaySeconds to make requests for status checkstimeoutSeconds
: how long to wait for a response to a requestfailureThreshold
: how many failed responses must be received to consider the check failed (or how many times to repeat the check before restarting the pod or disconnecting from the network)successThreshold
: similarly, but to consider the check passed
Useful links
Configure Liveness, Readiness and Startup Probes — official documentation
Originally published at RTFM: Linux, DevOps, and system administration.