Designing Resilient Systems: Circuit Breakers or Retries?

spockz · on Dec 25, 2018

I’m struggling to see why this should be an or. Use circuit breakers to prevent overloading or abusing individual instances and use retries, or backup requests to prevent customer impacts.

EamonnMR · on Dec 26, 2018

I've found that the best practice tends to be retry with exponential back off then ultimately log a failure (raising a warning with airbrake, rollbar, etc) and fail hard.

sitkack · on Dec 26, 2018

Exponential backoff with jitter and a maximum, and a fixed number of retries at maximum, but log alerts with each one.

lwansbrough · on Dec 25, 2018

The post uses the term “bulwark” - this is the first time I’ve heard of that. Is it any different than a mutex or semaphore?

SlowRobotAhead · on Dec 26, 2018

It’s pretty much a semaphore, except that the author seems to tie it in with a Request Volume Threshold that prevents a number of attempts per time as well as a total number here, which is interesting to me where I have an application that will go from phone app to server and hadn’t considered the difference.

That 10 connections is different from 10 instant connections.

sokoloff · on Dec 25, 2018

Bulwark is a non-CS term. Comes from military/nautical origin, a defensive barrier or the extension of the sides of a ship above the top deck.

oweiler · on Dec 26, 2018

The probably more commonly used term is Bulkhead.

grigjd3 · on Dec 25, 2018

Good discussion. My only critique is that a discussion of what is the best kind of failure should always start by discussing how different situations mean different preferences on the kind of failure you get.

Animats · on Dec 26, 2018

A good basic intro. A "circuit breaker" is discussed as independent of the load balancer. Often those functions are combined. How does the load balancer detect when one of its resources has gone down? Or is overloaded and just slow. And how do you keep the load balancer from being a single point of failure?

Not much mention of whether requests are idempotent. Some things you don't want to retry, like payments.

Are you retrying to use different resources? Or in hopes the same resource will come back up? Those imply different strategies. Older telephony systems tried one retry with different resources, then gave up. If a call had failed twice, it probably wasn't a transient problem.

joshka · on Dec 26, 2018

Title should just be Circuit Breakers. Alternatively put back the (part 1).

sciurus · on Dec 26, 2018

s/or/and

dkrikun · on Dec 26, 2018

Question: how relevant is this with the serverless? Don't cloud providers provide also these kind of tools?

InGodsName · on Dec 25, 2018

Deadline based pipeline seems to work best for our cast.

exabrial · on Dec 26, 2018

Especially in async msg-based systems. ActiveMQ (and the JMS API) has built in support for setting a message expiration upon sending, but you can also configure all of these policies on the broker itself.

sitkack · on Dec 26, 2018

If the system is down for a day, and reports only valid for that day, there is no sense in "catching up". Throw the old work away and start with the fresh stuff.