Kubernetes: Liveness and Readiness Probes — Best practices

Tips for using livenessProbe, readinessProbe, and startupProbe in Kubernetes

Kubernetes: Liveness and Readiness Probes — Best practices

Some useful tips on using Liveness and Readiness Probes in Kubernetes — the difference between them, and how to properly configure these checks.

To put it very briefly:

  • livenessProbe : is used by Kubernetes to know when to perform a Pod restart

  • readinessProbe : is used by Kubernetes to know when a container is ready to receive traffic, that is, when the corresponding Kubernetes Service can add this Pod to its routes

  • startupProbe : is used by Kubernetes to know when a container has started and is ready to perform checks with livenessProbe and readinessProbe

  • livenessProbe and readinessProbe will start executing only after a successful startupProbe check

So, livenessProbe is used to determine whether the process in the event is alive, while readinessProbe is used to determine whether the service in the event is ready to receive traffic, and startupProbe is used to determine when to start executing livenessProbe and readinessProbe.

This post is based on three materials that I once saved and use quite a bit:

And there will be a few more links in the end of this post.

livenessProbe

The livenessProbe is needed when, for example, a process is stuck in a deadlock and cannot perform its tasks. Another example is if a process has entered an infinite loop and is using 100% of the CPU, while being unable to process requests from clients because it is still connected to the Kubernetes network.

If you have a readinessProbe but no livenessProbe, then such a Pod will be disconnected from traffic, but will remain in the Running status, and will continue to occupy CPU/Memory resources.

Requests to livenessProbe are executed by the kubelet process on the same WorkerNode where the container is running, and after restarting, the sub will be created on the same WorkerNode

A process in a container should stop with an error code

livenessProbe should not be a tool for responding to service errors: instead, the process should finish its execution with an error code, which will stop the container/Pod, and create a new one.

livenessProbe is used only to check the status of the process itself in the container.

Split Liveness and Readiness Probes

It is a common practice to use one endpoint for livenessProbe and readinessProbe, but set a higher value of failureThreshold for the livenessProbe, that is, disconnect traffic and customers earlier, and if things are really bad, then restart.

But these Probes have different purposes, and therefore, although it is acceptable to use the same endpoint, it is better to have different checks. In addition, if both checks fail, Kubernetes will restart a Pod and disconnect it from the network at the same time, which can lead to 502 errors for clients.

Avoid dependencies

Pods should not refer to each other or to external services when running livenessProbe: your container should not perform database server availability checks, because if the database server is down, restarting your Pod will not help solve this problem.

Instead, you can create a separate endpoint for the monitoring system and perform such checks there - for alerts and dashboards in Grafana.

In addition, a process in a container should not crash if it cannot access the service it depends on. Instead, it should perform a connection retry, because Kubernetes expects pods to be run in any order.

Correct processing of the SIGTERM signal

A process in a container must correctly handle the SIGTERM signal - it is sent from the kubelet to the containers when they need to be restarted in the case when livenessProbe is failed. If there was no response to the SIGTERM (because the process is "hanging"), then a SIGKILL will be sent.

Or the process can perceive SIGTERM as SIGKILL, and stop without closing open TCP connections - see Kubernetes: NGINX/PHP-FPM graceful shutdown and 502 errors.

readinessProbe

The readinessProbe is needed to not send requests to Pods that are still spinning up, and are not ready to handle requests from users.

For example, if your Pod startup process takes 2 minutes (some kind of bootstrap, especially if it is a JVM, or loading some cache into memory), and you do not have a readinessProbe, then Kubernetes will start sending requests as soon as it enters the Running status, and they will fail.

Checking dependencies

Unlike livenessProbe, in readinessProbe it may make sense to check the availability of services on which the Pod depends, because if the service cannot fulfill a request from a client because it does not have a connection to the database, then you do not need to allow traffic to this Pod.

However, keep in mind that readinessProbe is executed continuously (every 15 seconds by default), and a separate database query will be executed for each such check.

But in general, it depends on your application. For example, if a database server is down, but you can return responses from some local cache then the app can continue to work, and return a 503 error for requests with write operations.

startupProbe

Since a startupProbe is executed only at the start of the pod, this is where you can check connections to external services or cache access.

For example, it can be useful to check the database connection when you deploy a new version of the Helm Chart and have a Kubernetes Deployment with Rolling Update, but the new version has an error in the URL or password to the database server.

Also, startupProbe can be useful to not increase the initialDelaySeconds for livenessProbe and readinessProbe, and instead delay their start until the startupProbe is finished, because if the livenessProbe does not have time to complete when the container starts, then Kubernetes will restart the pod, even though it is still "warming up".

Types of checks

In each Probe, we can use checks by:

  • exec: execute the command inside the container

  • httpGet: execute the HTTP GET request

  • tcpSocket: open TCP connect to the port

  • grpc: make a gRPC request to a TCP port

Parameters for Probes

All Probes have parameters that allow you to fine-tune the time of the checks:

  • initialDelaySeconds: delay between container start and start of checks

  • periodSeconds: how often after initialDelaySeconds to make requests for status checks

  • timeoutSeconds: how long to wait for a response to a request

  • failureThreshold: how many failed responses must be received to consider the check failed (or how many times to repeat the check before restarting the pod or disconnecting from the network)

  • successThreshold: similarly, but to consider the check passed


Originally published at RTFM: Linux, DevOps, and system administration.