I find that I quite often write in terms of organizations that may be on the lower end of the maturity scale… this article is no different. Most infrastructure or applications are monitored in some capacity or another through various tools or services. So why does something that is as important as the continued health of the services your business utilizes seems to be an afterthought?
Since event management is part of Service Operations, one would think that in a journey to provide exceptional service this process would have the same level of review that, let’s say, Incident management does. After all, when we are talking about the lifecycle of a service, it should encompass an “end to end” view of what is going on.
Event management, by definition, is monitoring your infrastructure or application layer to notify you if something “bad” is happening or about to happen. Alternatively, it also allows you to identify what the normal parameters of your service should look like and ultimately give you a baseline to measure against. In my experience, one of the fundamental challenges with event management is that it lacks the visibility that something like incident management has. Ironic really, when you think that the two really operate in the same space from a service availability perspective. Another challenge is that event operates in silos by having infrastructure teams only monitoring “what matters to them.” The risk in not looking at the big picture is that we may not tie all the pieces together to get a better sense of the service we are delivering and what constraints exist in the service provision.
Here is an example:
Consider Application X in the diagram, in which we have the given infrastructure and customer use. John E. User accesses the application Monday to Friday from 10 a.m. to 4 p.m.
- There are two application servers with a load balancer.
- Infrastructure teams monitor this 24/7, and a threshold is in place to show that the application server is strictly up or down.
- Server OGILP02 has had some issues lately and has only had an uptime of 75% on average.
- The plan is to replace it, but since there is another taking the load, the rush to beg for money isn’t quite there.
- Server OGILP01 has shown some issues where, when the load exceeds 80%, some users experience performance slowness.
- None of these issues have been tracked in the form of an Incident or a Problem.
- The application support team also monitors the devices from an application level.
- Very few errors are seen from the application layer despite their team getting escalations from the service desk indicating that there are issues that they are not able to produce.
The thought from the IT department is that service is up and running and from all accounts is pretty solid. From the customers’ perspective, aside from some intermittent slowness the service is pretty stable. For the most part life is good for John and the people at AnyCorp, but what they don’t realize is that there is something bad about to happen.
Monday morning, John rolls in to find that he cannot launch Application X. He calls his Service Desk who assures him that they will take care of this. They escalate to the infrastructure team who indicate that they had an alert last night showing once again OGILP02 had fallen over and required a manual restart. The service analyst is sharp as a tack and also calls the application manager, who indicates since there was only one server available the call times for the application were also timing out. The application manager extends the timeout from 60 seconds to 180 seconds to account for this in the future. Both the infrastructure and applications teams fix everything and all is well…. or is it?
What did we learn?
While the monitoring did tell us everything we need to know, we did not track it anywhere that we could leverage this information. By integrating these events within Service Operation processes we could:
- Address availability concerns by quantifying any known issues.
- Establish any capacity targets.
- Allow us to investigate root cause for OGILP01.
- Give us solid statistics to raise capital for a more robust environment. Three servers would allow us to provide high availability, etc.
The challenge as always is to market to our teams why these should be integrated, and if there is a way which we can do this with as little manual effort as possible—all the better.
Keep this in mind –while there may be a little work in the beginning, you will save that effort later.
It boils down to proactively solving issues before they are apparent to the customer (which is why you have the monitoring in the first place oddly enough). Implementing this may require small steps, but their successes will enable you to show other teams event management’s benefits.
For more brilliant insights, check out Ryan’s blog: Service Management Journey