On Monday, August 8, the Atlanta headquarters of Delta Airlines suffered an “electrical problem” at about 2:30 a.m. Technicians from Georgia Power quickly determined that it was a failed switchgear, a high-capacity circuit breaker box that routes power from two or more sources to the various systems that use it. It allows the orderly disconnect of power for service—until it fails, of course. This particular failure was important because the switchgear provided power to Delta’s data center, which hosts all of Delta’s enterprise systems, from reservations to boarding to… well, everything. They had backup systems, but not all of them switched over. Delta, up until that point the most reliable carrier, had to cancel hundreds of flights until they could bring their systems back online. This came just days after a 12-hour outage at Southwest Airlines when a network router failed, forcing the reboot of their entire system. Both airlines lost millions in revenue and even more in prestige.
Redundant Arrays of Inexpensive Everything
In a modern data center, everything is designed to survive failure. Servers are powered by two or more supplies, which are fed from uninterruptable battery-backed systems, via two or more sets of power cables, which are connected to a device that smoothly and automatically switches between commercial power and on-site generators. Multiple air conditioning and liquid cooling systems regulate the temperature of the servers, often managed at rack level. Network connections to several ISPs come in via different cables, feeding an array of redundant routers and switches. Like the power connections, servers have two or more network connections. Monitors and sensors report on every aspect of the infrastructure, and physical security controls prevent intrusion. Critical applications are run on servers mirrored to servers in other data centers in a different location, so a regional event won’t prevent continuity of operations.
Of course, this massive redundancy adds cost, but it also increases system availability. When a failure occurs, service continues. Business-critical applications don’t miss a beat, and a service technician is quickly dispatched to replace the failed component. It costs more to acquire and maintain—after all, more components equals more failures—but the probability of unplanned interruptions drops to near zero.
Testing Failover, Survivability, and Continuity of Operations
It isn’t enough to buy all of this stuff and install it. Just as smart organizations hire ethical hackers to test their network security, good data centers test their survivability. Frequent testing is required, to ensure that the power, temperature control, and network infrastructure “fail over” when needed. Mirrored servers should be cycled on a regular basis. Backups should be restored and used for development and test systems. When replacement components are stored on site, they should be periodically swapped into production to ensure they work, and the replaced item put on the shelf. Fuel for generators should be tested periodically for contaminants. Locks should be exercised, to ensure they won’t inhibit access in an emergency. All the little things matter—remember, that Southwest Airlines outage was caused by a single router!
There is an alternative to the private data center: move to the cloud. Infrastructure-as-a-service and platform-as-a-service are viable solutions, and so is renting rack space in public data centers. Not all applications need the same level of survivability; the airlines don’t need the same level of availability for their business expense claim system as they do for their reservations system. The market now offers the ability to pay for just the level of availability and survivability required by the business for a specific application. This is a level of granularity we just can’t get in a corporate data center, where everything from production enterprise applications to software development and testing servers are housed under the same roof, with the same cost structure.
Not all software system requirements are about functionality. Ensure your earliest conversations determine the level of availability required and the expected amount of network traffic. Then arm your software architects and procurement team with the information they need to make cost-efficient decisions.
For more brilliant insights, check out Dave’s blog: The Practicing IT Project Manager