I was recently at a vendor event where there were several infrastructure and operations managers gathered, and the topic of operational availability came up. As you would imagine, everything seemed to be quantified in the amount of nines. I had of course heard uptime described this way but thought, “What is the difference between five nines and six?”
If you were to tell someone in the business that their application uptime was 99.987% last month and 99.894% the month previous, they would likely give you the thousand mile stare because they wouldn’t understand what that meant to them in terms of their service availability. Availability needs to be communicated in a way that has meaning. For example, your service was not available for 34 minutes last month.
Think about it in day to day terms…
If I was to tell you that the availability of the lights in your house had an uptime of 99%, what would that mean to you? After some mathematics it might imply that 72 hours of the year the lights would not ‘be available.’ Pondering on this for a moment you might decide, “Well, I don’t need them all the time; I am asleep eight hours a day so that shouldn’t be an issue.” If the uptime was revolving around your business hours (when you actually need your lights available), there may be more potential impact.
So what do these 9s of service really amount to from a time perspective? Below is a diagram to illustrate what this works out to on 24 hour business need.
Fundamentally we need to understand what makes our services tick. What hours comprise of service availability? Do our customers need 24 x 7 availability, or are there specific business hours in which we must ensure operation? Much like the household example above, we might not be concerned (from an availability perspective) if the lights aren’t on over non-business hours.
Another consideration might be how your infrastructure is built to support the service in question. Let’s assume your service is comprised of two web servers as shown in the diagram. Gathering metrics from them both, you have identified that there is an availability of component ‘x’ at 96% and component ‘y’ at 99%. It turns out that the availability from component x does not match the 99% of component y. How can this be, you might question? There are many reasons how this could be occurring. At the end of the day, the service availability is 99% as far as the business is concerned. However, there is a risk that these two components are not the same and an issue could impact the ability to provide service over the long term.
This is why we need to know all the pieces involved and determine a way to manage them effectively. We have identified when the service needs to be available and what is operating in the backend to ensure it is delivered, but we also need to think of the processes which can assist the operations team in delivering that service. By understanding what drives the service operationally, we will be able to regularly perform assessments to target areas for continual service improvement. In this case a problem could be created to investigate why component x is under-performing.
In the end it can be tricky to report against the uptime for the service because the final piece (the customer experience) may not be accounted for accurately. Relaying to the business that they had two hours of business outage time last month will only be as accurate as the input data from the IT systems. The trouble is that the customers may say that there were several hours of outage time that were not accounted for, when they didn’t escalate. Despite our infrastructure metrics, we may have neglected an application issue which impacted service.
Documenting these types of discrepancies from the business and cross referencing with IT statistics will allow you to address where the lost time gaps are coming from so that you can always improve the customer experience. We might not get it right on the first try, which is why continual service improvement is a cyclical process.
For more brilliant insights, check out Ryan’s blog: Service Management Journey