Fact: Your IT operations are monitoring your infrastructure in some capacity. Whether it is network traffic, database activity, application health, or a combination of these, your goal is to ensure stability.
How well is this working out?
Depending on your IT organizational structure, each “silo” may answer that everything is working rather well, so you might want to reframe the question to determine what value your monitoring is adding.
If you were to look at it solely from a monitoring perspective, it will do exactly what it was designed to; the server is up, the memory usage is high, and so on. But without some mechanism to address the baselines and someone (or something) accountable to the output, you may never really leverage the potential of the tools you have at your disposal.
One of the first challenges is identifying all the tools we are using to do the job. Are there any overlaps in what these tools are doing? What costs are associated with monitoring with disparate systems? What are the Operations teams doing with the alerts that they receive today? This last question takes a bit of professional honesty; if we are just opening the alert emails in bulk and acknowledging them in the monitoring tool, we need to know that. We all know it happens. Sometimes alert thresholds are set up out of the box and we just never seem to have time to go back and configure them.
Next we should take a look at who currently has visibility on what is monitored and if anyone else should have visibility. Many tools have slick dashboards where execs can take a look at what is going on, while others drive workflows to create and escalate incidents to the appropriate teams. We need to outline the process that manages this and what outputs we need to achieve success.
Ultimately we should be able to match any incident with an alert, if one exists in our monitoring environment. If we are unable to line them up—not to worry, as this mismatch is also an important indicator as well, and may suggest one of the following issues:
- If we see incidents and no events, we may need to address thresholds that are too high.
- If you see events and no incidents, we may have our threshold as it is too low.
- It is possible that those managing incidents and the event monitoring simply need to communicate.
At either rate this does allow us to manage the baselines of our services.
Which brings us to what we are monitoring… and why. Just because we can monitor everything doesn’t necessarily mean we need to. We need to fully understand what services we are providing and how the events are triggered by the inability to use said service.
For example, if we are monitoring up/down of a server there are several considerations to think of with regard to the business perspective:
- Does the time of day make a difference regarding the use of the service? Is it 24/7 or not? Does an outage on Saturday make this less critical?
- What is the impact of the service outage based on duration—does the impact increase the longer the service is unavailable? Are there specific actions that need to be taken at intervals in time? If so, how often does the monitoring alert us to ensure we are correcting potential issues in a timely way?
- Understanding the architecture of the service is equally important. If there is a clustered environment, for example, we may not have an outage should one server be unavailable. Despite the alert saying the server is down we may see degraded performance of the service instead.
Some of these considerations play into availability a bit, so we want to ensure we have enough “service” for the business need. Remember that overdoing it won’t necessarily be cost-effective. (See “Continuous Availability: Is It Really Necessary?”)
Remember to work with your business to prioritize your services based on business need; don’t make assumptions on what you think they need. This is just one more place where you can discuss with the business on what matters to them. Keep the dialog open.
For more brilliant insights, check out Ryan’s blog: Service Management Journey