What should the HTTP Status Code of a Degraded Health Check Be?

I would be wary of splitting hairs like this on a healthcheck on the upstream server side. The service providing the healthcheck should be lightly (and concurrently) testing all its upstream dependencies based on its own set of policies or rules - request timeouts, connection failures and so on. In reality the healthcheck either works or it doesn't and the application shouldn't really need to be keeping track of the results of the healthcheck (other than capturing metrics about what happened). IMHO a stateful healthcheck is a recipe for disaster.

I typically use the following interface for application healthchecks:

204 - No Content, everything is working within tolerences

500 - Something failed, and here's some details in the response about what went wrong

Where it gets tricky depends on your architecture. You may have a VIP or reverse proxy that is interpreting this response and deciding if a given node is healthy or not, in which case it's going to either route the request to a healthy node or return the 503 Service Unavailable. This decision is going to made on some policy basis - x healthcheck requests failed over a y time period across z upstream services.

If you use a mesh then everyone can feed data back to the service registry to keep the health state up to date and it can be based on actual service calls rather than a healthcheck.

The client is perfectly placed to make a decision based on the health of services it depends on as they can keep track of the various responses from the service. Circuit breakers are an excellent way to handle that and can do it continuously on actual requests rather than just on the healthcheck. Circuit breaker libraries (such as resilience4j) will do this for you at the cost of setting up some policies about how many failed/slow requests constitute a bad service. Service Registrys like the netflix eureka can help with the discovery and ongoing monitoring.

The most suitable HTTP status code for a "Degraded" status response from a health endpoint is nothing other than 200 OK.

I say this because I can't find any better code in the official Hypertext Transfer Protocol (HTTP) Status Code Registry maintained by IANA, pointed to by [RFC7231] HTTP/1.1: Semantics and Content. Unofficial codes should be avoided, because they only make your API more difficult to understand.

You should design your APIs so that they become easy to use. Resource names, HTTP verbs, status codes, etc. should be more or less self-explanatory, so that people who already know "the REST language" can immediately understand how to use your API without having to decipher vague names or unusual status codes. Which brings me to the next part of my answer...

Other comments on your design

The most natural way to interpret a 5xx response to any request is that the operation in question failed.

So a 503 Service Unavailable response to a GET /status request means that the status checking operation itself failed. Such a response would only be useful if we can be certain that /status is a health endoint, as pointed out in the API Health Check draft referred to in Nkosi's answer:

A health endpoint is only meaningful in the context of the component it indicates the health of. It has no other meaning or purpose. As such, its health is a conduit to the health of the component. Clients SHOULD assume that the HTTP response code returned by the health endpoint is applicable to the entire component (e.g. a larger API or a microservice).

But with a URL path of just /status, it is not completely obvious that this really is a health endpoint. From looking at the URL, we only know that it returns information about the status of something, but we can't really be sure what that "something" is.

Since you're also telling us that yes, it is in fact a health endpoint, I must suggest that you change the name to health. I would also suggest placing it under some base path, e.g. /things/health, to make it more clear which component it indicates the health of.

If, on the other hand, /status was actually a resource of it own, i.e. something that represents the status of some other component/thing (like its name currently suggests), then 200 OK is the only reasonable status for successful invocations, even if the thing that it indicates the status of is "Unhealthy". In that case, a 5xx would mean that no status could be obtained, and details in the response payload would be assumed to be related to a failure in the /status service itself.

So be careful with how you name things and what status codes you use!

Consider returning a custom code within the 2xx Success range that is not already taken within the known/common status codes. Similar to some of the unofficial codes not supported by any standard.

For example 218 This is fine (Apache Web Server)

Used as a catch-all error condition for allowing response bodies to flow through Apache when ProxyErrorOverride is enabled. When ProxyErrorOverride is enabled in Apache, response bodies that contain a status code of 4xx or 5xx are automatically discarded by Apache in favor of a generic response or a custom response specified by the ErrorDocument directive

After doing some research I came across a draft

Health Check Response Format for HTTP APIs: draft-inadarei-api-health-check-03

Where they also made similar suggestions

In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response.

where the warn status in the draft is healthy, with some concerns, which I believe aligns closely to your desired model.

While not definitive, I believe it provides some ideas to help with the eventual design.

What should the HTTP Status Code of a Degraded Health Check Be?

Tags:

Http

Health Monitoring

Http Status Codes

Http Status Code 503

Kubernetes Health Check

Related

Recent Posts