The phrase "Failure is not an option" is tossed about with much bravado, as though one could make something work by just their strength of will. But the fact remains, things eventually fail. Everything. How then, do you handle the inevitable failure of your microservices? Well, by combining containers, Kubernetes, Red Hat OpenShift, and Istio, we can skip over-the-top displays of swagger, let the system handle things, and get some sleep at night.
Once again, Istio provides the basis of a popular and well-tested technology: The Circuit Breaker Pattern.
Like a circuit breaker in an electrical circuit, the software version allows flow to a service to be shut off. The circuit opens in the case where the endpoint is not functioning correctly. The endpoint may have failed or may just be too slow, but it represents the same problem: this container is not working.
Lagging performance is especially troublesome: Not only can the delay cascade back through any calling services and cause the entire system to lag, but retrying against an already-slow service just makes it worse.
[Note: this is part four of a ten-week series. Part three is available here.]
The circuit breaker is a proxy that controls flow to an endpoint. If the endpoint fails or is too slow (based on your configuration), the proxy will open the circuit to the container. In that case, traffic is routed to other containers because of load balancing. The circuit remains open for a preconfigured sleep window (let's say two minutes) after which the circuit is considered "half-open". The next request attempted will determine if the circuit moves to "closed" (where everything is working again), or it it reverts to "open" and the sleep window starts again. Here's a simple State Transition Diagram for the circuit breaker:
It's important to note that this is all at the system architecture level, so to speak. At some point your application will need to account for the circuit breaker pattern; common responses include providing a default value or (if possible) ignoring the existence of the service. The bulkhead pattern addresses this, but it's outside the scope of this post.
The Istio Circuit Breaker in Action
To start, I've launched two versions of a microservice "recommendation" into OpenShift. Version 1 is running normally while version 2 has a built-in delay. This mimics a slow server. Using the tool siege we can observe the results:
siege -r 2 -c 20 -v customer-tutorial..nip.io
Everything is working, but at what cost? While 100 percent availability may seem at first glance to be a win, look closer. The longest transaction took over 12 seconds. That's not exactly speedy. We need to somehow avoid this bottleneck.
We can use Istio's circuit breaker functionality to avoid these slow containers. Here's an example of a configuration file that will implement the circuit breaker:
The last line, "httpMaxRequestsPerConnection", means that if a second connection is attempted against a container that already has an existing connection, the circuit will open. Because we've purposely made our container to mimic a slow service, it will occasionally encounter this condition. When that happens, Istio will return a 503 error. Here's a screen capture from another run using siege:
The Circuit Is Broken; Now What?
Without changing our source code, we are able to implement the circuit breaker pattern. Combining this with last week's blog post (Istio Pool Ejection), we can eliminate slow containers until they recover. In this example, a container is ejected for two minutes (the "sleepWindow" setting) before being reconsidered.
Note that your application's ability to respond to a 503 error is still a function of your source code. There are many strategies for handling an open circuit; which one you choose depends on your particular situation.