Recently Amazon's cloud service experienced a run of bad luck and went down, despite strict precautions and multiple backup plans. According to Amazon, a cable fault at the high voltage utility power distribution system caused it to go down, setting in motion the backup generators. However, one of those generators overheated and also shut down, causing the second backup generators to kick in. And then (yes, that's right), the second backup generator failed due to an incorrectly configured circuit. All told, the entire system of generators failed. It took two hours to get the generators back online and the cloud restored, but the story goes to show that a 1 in 1,000 situation is still a possibility. The article's author, Jon Brodkin, goes on to explain that while customers can save money using outsourcing, there is still a lot of risk:
For many customers, particularly ones without large data center budgets, outsourcing to Amazon or similar vendors makes a lot of sense even when you consider that there are occasional outages. Outages can be embarrassing, like RIM's worldwide outage affecting BlackBerry services last fall. Some can be puzzling, like one in Dublin last year affecting both Amazon and Microsoft that Amazon initially said was caused by a lightning strike hitting a generator, leading to an explosion and fire. It turned out to be a more mundane failure of a transformer operated by the local electricity company.
Mistakes (or just plain bad luck) will happen, and it's naive to expect that all services will be operational at all times. However, keeping yourself aware of potential setbacks through vendor error and risk is an important step in keeping your own organization on the right track.