13th January 2024

Horrendous health checks and their unhealthy hindrances

If you're deployed on a cloud-native platform like kubernetes or AWS ECS, it's probably tempting to bloat your health checks. After all, if your service can't contact its database then it's not healthy! Don't. That's basically the whole post: Don't. In this post we'll use the database as an example, but the same could apply to almost any dependency.

Here's why:

Our Super Simple Service™

It has:

a load balancer
some web instances to serve traffic
a database
happy users

Maybe you've got some fancy database sharding, maybe you've got fleets of background workers, maybe you've got several layers of reverse proxies and an edge provider - we're leaving all that aside for the time being. Those web instances are probably ECS tasks or k8s pods or whatever else you're using.

Dynamic service capacity

You want to save some money, and also want to be able to meet demand when it increases, so you probably have some form of horizontal scaling built in. For example: you know that a single instance running your service can handle 500 requests per minute, so when the request count per target metric reaches 400 it automatically adds another instance. This simply means that when each instance is receiving on average 400 requests in a minute, it adds another instance and the load balancer will now be distributing the load across more instances - each instance remains under that 500 requests per minute threshold and has enough headroom to handle some small point-load spikes.

You probably also have automatic capacity replacement for resiliency. Your cloud provider will constantly look at the number of healthy instances you have and compare that metric to:

your minimum healthy instance count
your maximum healthy instance count
your current request count per target levels

If they don't tally, the cloud provider will add or remove instances as needed until all is meeting expectations again. Even if you have no autoscaling, you will almost certainly have a minimum healthy instance count.

The Health Checks

Let's for the time being assume that you haven't taken the advice in the first paragraph, and your health check for an instance includes checking that said instance can access the database. This was probably for what felt like a good reason: if the instance can't connect to the database then it can't properly serve user traffic. Quick! Stop the load balancer from sending traffic there and get rid of it so it can be replaced by the capacity replacement mentioned above! The way you implemented this is probably by implementing an HTTP endpoint in your service called something like /service/health which does things like select(1) on the database to prove it can successfully connect.

require 'sinatra'

get '/service/health' do
  DB_CONNECTION_POOL.get_connection.query('select (1)')
  status 200
  'up'
rescue DatabaseConnectionError => e
  LOGGER.error("Cannot connect to DB: #{e.message}")
  status 500
  'down'
end

All is well. That's boring, let's wake up our on-call engineer:

Death by Health Check

Something has happened to our database and the application is experiencing errors connecting to it. Maybe it's a network misconfiguration, maybe an outage with the service provider, maybe even something as mundane as a failover causing connection issues for a few minutes. Either way, it's probably either easily fixable or it resolves itself in a few minutes. We'll do an incident post-mortem the next day but a few minutes of user disruption isn't the end of the world.

That should be the end of the story, but unfortunately the instance health checks are coupled to the database connection... So this happens instead:

The issue was resolved quickly, why are our instances dead? What's happened?

Our application experiences errors opening new database connections
The application health checks fail during the small database outage - not all of them, but enough to do significant damage
The scheduler takes those instances out of service
Our database recovers, but the damage is already done
Our capacity drops and although the database is back, we don't now have enough capacity to serve user traffic
Users experience "Bad Gateway", "Gateway Timeout" errors and the like
Users all spam refresh on the page, leading to a large point-load on the service
The scheduler sees that there aren't enough instances to meet the minimum healthy instance count requirement and spins more up
The few instances now available are too busy coping with all the extra load dumped on them to respond positively to health checks
The scheduler sees failed health checks for these instances and takes them out of service too
Some new instances come into service
Go back to point 9

The service will stabilise eventually, probably because load will drop off enough that the instances the scheduler spins up can handle it and traffic gradually increases again - no point load, the average request count per target increases slowly and more instances slowly come into service. But that's a lot of grief and recovery took a lot longer than it should have done considering our database recovered in point 4. There will probably be a blameless incident post-mortem in which blame will probably be implied.

What's to be done?

The reality is that if instances are failing to connect to databases we still want to know about it. The odd one will happen occasionally because networks are fickle beasts but if lots happen we definitely want to know about it. We still want the on-call engineer to be dragged away from whatever AWS conference he's ~~drinking himself silly~~ representing the company at.

Separate your health checks into different endpoints, often called a liveness and a readiness check - the former checks whether an application process is running at all, the latter checks the dependencies such as a database. If enough of either fail then Pagerduty should definitely be firing, but we don't want the scheduler to kill instances based on the readiness check - only failing the liveness check a few times in a row is worthy of capital punishment. If your setup is fancy enough the load balancer can stop sending requests to an instance failing the readiness check until it succeeds again, but it shouldn't get trigger-happy.

require 'sinatra'

get '/service/health/live' do
  status 200
  'up'
end

get '/service/health/ready' do
  DB_CONNECTION_POOL.get_connection.query('select (1)')
  status 200
  'up'
rescue DatabaseConnectionError => e
  LOGGER.error("Cannot connect to DB: #{e.message}")
  status 500
  'down'
end