So far we’ve talked about the internal workings of the queue. But in real-world systems, you have multiple services that generate workloads for one another, each with its own queue and variable range of performance. Ideally, services will avoid creating excess workloads for other services downstream from them.
The way this is done is known as “backpressure.” This is the process whereby downstream services notify upstream systems when their queues are full. The upstream service then notifies the services upstream from it, and so forth.
The TCP protocol, for instance, generates backpressure with code 503. Thread pools can implement backpressure by simply limiting the size of the pool. Blocking code has backpressure by default on a per-thread basis, which back-propagates naturally through the blocking chain.
But systems like Kafka don’t have backpressure by default. Luckily there are several open source projects that implement backpressure in Kafka, so that the webserver knows when there’s an overload in the system.
The question then is what to do about it. It’s basically a trade-off between latency and error rate. If we don’t return errors, or find an alternative way to shed the excess load, then latency will inevitably increase.
One approach, as we’ve seen, is to cap queue size and shed any job over the limit. This can be done by implementing a queue at the webserver and having the server return an HTTP 503 response for any request over the cap. Alternatively, the server could present a cached version of the results or send the request to a fallback service. However it’s handled, the goal is to remove excess requests from the overloaded queue.