4 Lessons Learned from Failure Fridays

It would be conducive to the success of most organizations to implement “Failure Fridays.” What is that exactly? Tim Armandpour explains what Failure Fridays are and their benefits in an article for InformationWeek.

Throwback Thursday’s Cousin

Failure Fridays are a way to find issues early on in production. It is a “weekly practice for injecting failure scenarios” in your infrastructure. The purpose of doing so is to be proactive in solving problems within the infrastructure. From this, it become faster to identify issues and resolve them. Here are four ways to operate with Failure Fridays:

  1. Keep your testing fresh.
  2. It’s not a dress rehearsal.
  3. “Gotchas” make you stronger.
  4. Hold a blame-free post-mortem.

Weekly get along with your team and determine the new thing you will be testing. It can be something small, like taking down a single process, host, or new service, or even something big like taking down an entire data center. You also need to test different classes of failure, for example, hard shutdown or services to connection timeouts, etc.

Next, you have to be intentional about what failure scenario you choose to introduce. It also needs to be conducted in a real-life environment, not in a testing or pre-production one. Try to perform your tests as close to a real-life situation as you can. When you get genuinely surprised by problems you uncover, the  “gotchas,” these will be the times you can make the biggest improvements:

When we shut down an entire data center, the nodes that had been shut down wouldn’t come back up. We ended up replacing the nodes on the fly, but couldn’t do a rolling restart anymore due to the loss of quorum. We solved it by live patching some of our code to temporarily bypass Zookeeper as we repaired the cluster.

Though it took many hours to complete, we did the repair without any customers noticing, and learned quickly about the importance of process and best practices. The practice of working through failure scenarioswais as important as resolving the issue itself.

Lastly, when you have finished up the test, you need to discuss all details of what went wrong without pointing fingers or placing blame. As well, a list of steps to prevent it occurring in the future needs to be composed. These post-mortems are a good way to reflect and learn together to focus on what actions need to take place in the future. Investments in improvements are a great way to preempt risks before they ever blossom.

You can access the original article here:

Show More

Leave a Reply