Designing and delivering high-scale services requires automation – but you must be aware of whether the service you’re providing is the right fit for automation and scaling. This PDF by James Hamilton discusses lessons learned and concepts used on the Windows Live Services Platform. In particular, Hamilton explains how good design and automation can allow operational systems to scale significantly without reliance on humans. Listing ten recommendations, Hamilton’s paper suggests, among other things, keeping things simple and robust: Complicated algorithms and component interactions multiply the difficulty of debugging, deploying, etc. Simple and nearly stupid is almost always better in a high-scale service—the number of interacting failure modes is already daunting before complex optimizations are delivered. Our general rule is that optimizations that bring an order of magnitude improvement are worth considering, but percentage or even small factor gains aren’t worth it. Hamilton goes on to deal with the human side as well: have you organised your customer and press communication plan? This seemingly small detail can in fact make the difference between a highly uncontrolled media frenzy and a handled description of events and next steps. If users are aware of what has happened and what you plan to do next (with a timeline), they are more likely to be satisfied with the service despite it’s unavailability.