By the way, a corollary I encountered, I think with one of the recent AWS meltdo...

By the way, a corollary I encountered, I think with one of the recent AWS meltdowns, is that a paradoxical consequence of designing for "reliability" is that it guarantees that when something does happen, it's going to be bad, because the reliability engineering has done a good job of masking all the smaller faults.

Which means 1. anything that gets through, almost by definition, is going to be bad enough to escape the safeguards, and 2. when things do get bad enough to escape the safeguards, it will likely expose the avalanche of things that were already in a failure state but were being mitigated

The takeaway, which I'm not really sure how to practically make use of, was that if a system isn't observably failing occasionally in small ways, one day it's going to instead fail in a big way

I don't think that's necessarily something rigorously proven but I do think of it sometimes in the face of some mess