As someone who works on a system that spans 10^big nodes, I really like the poin...

As someone who works on a system that spans 10^big nodes, I really like the point in that post. At a certain level of complexity even your most baseline assumptions will have counterexamples, and 99.9% uptime means 1 in 1000 nodes are in various states of failure at any time. You have to build fault tolerance into the design at each layer to keep errors at base layers from snowballing into real outages. And that’s inherently at odds with the “fail fast” principle, so you have to spend a lot of energy on how to fail fast inwardly but stay operational outwardly.