I’m struggling to see why this should be an or. Use circuit breakers to prevent overloading or abusing individual instances and use retries, or backup requests to prevent customer impacts.
I've found that the best practice tends to be retry with exponential back off then ultimately log a failure (raising a warning with airbrake, rollbar, etc) and fail hard.
It’s pretty much a semaphore, except that the author seems to tie it in with a Request Volume Threshold that prevents a number of attempts per time as well as a total number here, which is interesting to me where I have an application that will go from phone app to server and hadn’t considered the difference.
That 10 connections is different from 10 instant connections.
Good discussion. My only critique is that a discussion of what is the best kind of failure should always start by discussing how different situations mean different preferences on the kind of failure you get.
A good basic intro. A "circuit breaker" is discussed as independent of the load balancer. Often those functions are combined. How does the load balancer detect when one of its resources has gone down? Or is overloaded and just slow. And how do you keep the load balancer from being a single point of failure?
Not much mention of whether requests are idempotent. Some things you don't want to retry, like payments.
Are you retrying to use different resources? Or in hopes the same resource will come back up? Those imply different strategies. Older telephony systems tried one retry with different resources, then gave up. If a call had failed twice, it probably wasn't a transient problem.
Especially in async msg-based systems. ActiveMQ (and the JMS API) has built in support for setting a message expiration upon sending, but you can also configure all of these policies on the broker itself.
If the system is down for a day, and reports only valid for that day, there is no sense in "catching up". Throw the old work away and start with the fresh stuff.