Synchronous communication - circuit breaker and fallback
The circuit breaker is a communication pattern that helps to avoid cascading failure of the system and gives dependent services time to recover. Along with fallback values defined by the developer, it gives a pretty wide safe net when communication channel is broken.
The circuit breaker is not an IT invention. It’s widely used in electric circuits. The idea is pretty simple if everything is OK then the circuit is closed and everything works as supposed to. When something (power/voltage/etc) reaches the threshold circuit breaker opens the circuit, therefore, stopping the electric flow. (Electric circuit is the simplest connotation to it’s open/close state you can get :))
In software development, circuit breaker works in a very similar way and is usually used as a communication pattern. When the threshold is not reached circuit breaker is closed and requests are sent to the destination. When the threshold is reached then the circuit breaker opens and no further traffic goes to the destination system. In self-healing systems, there is one additional state defined as half-open. In this state some probing requests are sent to the destination to check if it’s responsive already and can proceed with normal traffic.
As you can see the idea is not so complicated and once you understand that open is bad state everything else is relatively simple ;). If you also bother to configure sane fallback values for certain responses it can hide network glitches or deployments from end-user pretty well. The idea itself might be simple, but as the usual devil is in the details. Luckily we don’t have to implement it on our own. All we have to do is to get familiar with available configuration options in existing libraries. In this post, I’ll check how the circuit breaker implementations provided by hystrix and resilience4j works.
To check how a circuit breaker works in the more real-life but controlled scenario I need some kind of a service that will fail on demand. To achieve it I’ve used very simple service that can return fixed response and which behavior I can change on demand - BreakableService. Once problem with control over dependent service is out of the way we can focus on the circuit breaker itself.
There is quite a lot happening in this sample. Line by line:
line 10-16 hystrix configuration - more options.
line 20-25 fallback value configured for hystrix and feign duo.
line 27-33 configure service to fail with 500 error and hit it until circuit breaker is closed.
line 35-36 verify that returned value is from fallback not actual server response.
line 39-44 make service healthy again and wait until circuit breaker closes.
line 45-46 verify if everything works as expected with circuit breaker closed.
I’ve prepared more samples related to error threshold which triggers circuit breaker state:
When using the feign-hystrix library you should be aware that hystrix is not only a circuit breaker. Hystrix offers a lot of functionalities that help to mitigate some risks related to network communication. This includes:
All Hystrix*Commands are executed in separate threads and they are grouped into pools which you define (pool_1 for service A, poll_2 for service B) (it’s also possible to use semaphores)
Timeouts for too long-running commands
As you can see Hystrix offers more than simple circuit breaker implementation and especially with thread pools it might save your tomcat threads from being busy doing nothing but waiting for IO. If you don’t need it all, looking for something specific or just want to use something shiny which is still under active development you can try resilience4j which offers more fine-grained control over which patterns you want to use and when.
Again going line by line:
line 5-11 circuit breaker configuration.
line 14 applied circuit breaker for retrofit client.
line 20-24 configure dependant service to be broken and hit it until circuit breaker opens.
line 26-27 verify if circuit breaker opens.
line 29-31 wait until circuit breaker opens again (once it opens it rejects all requests for some time).
line 33-36 make service healthy again and call it until circuit breaker closes.
line 38 verify if circuit closed after defined number of requests.
Couple more examples of using resilience4j with retrofit:
No matter what you’ll use you should always carefully consider how to configure network communication in your application. If you have some service that’s critical for the healthy work of your system then threshold set at a 50% error rate might not necessarily be a good configuration. You should also know when it makes sense to use this pattern and when it’ll simply do nothing but introduce an unnecessary abstraction layer. For example, it will not help with the communication route that’s called very rarely. If the user requests something 20 times a day it will not be a very good candidate to be handled with a circuit breaker (resilience4j with default configuration will not even warm-up).
You should not apply this pattern blindly in all places but you must carefully consider when it makes sense. If you are not sure if you need it there is a good chance you are not there yet. When you start a new project or functionality and don’t know what to expect you might get away with ensuring there is some kind of an extension point to which you’ll be able to hook up later and add circuit breaker if needed. Don’t simply follow Netflix because the scale does matter and something that is must-have for them might be overkill for you.
When I was looking for more information on circuit breaker and other communication patterns I’ve found Microsoft Azure documentation with all of them gathered in single place.
As always all working samples can be found on my GitHub
If you've enjoyed or found this post useful you might also like:
7 May 2020 #communication