The key goal of monitoring is an increase in situational awareness. There is no automation without observability, and there is very little situational awareness without monitoring. The role of monitoring is to deliver observable data in a human-readable format. We can, however, extract additional value by taking a few extra steps.
We aim to provide these kinds of information to application management personnel; this would enable them to perform corrective measures in a timely fashion, even before end users notice there was ever a problem.
For any of these areas, there are many monitoring solutions available. Some of them aim to cover more than one area at once and offer APIs or plugin integration. These kinds of expansions allow us to create a customized monitoring solution that would fill the missing links in the toolchain that the customer uses.
A good approach for establishing monitoring for complex HA solutions is to observe past incidents, analyze the “lessons learned”, and identify critical system segments and settings.
It’s a good approach to create a specification that our ideal monitoring tool should fulfill. However, it's hard to expect that one monitoring system should be able to meet all of our expectations. We should consider a single or several tools in order to cover as many needs as possible and have useful an API or plugin third-party integration type. This is by all means not an exclusion criterion, but even if one does not need the expansion at the moment, it is always nice to have this as an option in the future.
Once we select and have our monitoring solution up and running, it efficiently observes the system, offering gathered data and representing it to personnel in a human-readable format. An email, SMS or monitoring screen with condensed information is a common approach, but…
We aim to enable such a response reaction which would be quick and appropriate at the same time. For instance, let’s say an error occurs, and the system starts to misbehave. Monitoring picks this up and triggers an event or alarm by sending an email to the staff or blinking red until someone notices it. Sound alarms are not off the table either, as we are able to hear the alarm even if we are not looking at the display at the moment. Sound alarms can also be enhanced to become voice alarms with short descriptions of different kinds of errors. At this point in time, there is a number of common actions that alerted personnel can take (depending on the nature of the problem): they check the network, they check the application on the server, isolate and analyze parts of the application log, check the OS log and system resources, DB availability, recent OS patches and so on. Every single time.
But why would you check the network if it's fine, or check for recent OS patches if there were none recently? Because you don’t know it's fine, right? You are in the dark.
If we create a system that triggers the reporting script or a smoke test every time an error occurs, we can acquire this or any other custom info in a matter of seconds, thus reducing the MTTR (Mean Time To Recovery). This approach enables us to isolate the nature and area of a technical emergency much faster and may also suggest corrective measures. If you allow it, these corrective measures can also be automated, thus boosting system resilience. The staff still needs to be aware of any such self-repairs, by a special report of the automated actions that were taken.
Splunk, Elasticsearch, Nagios, App Dynamics, AVI Networks, CA Technologies, Amazon CloudWatch, CloudMonix, New Relic, Solar Winds, Microsoft OMS, Oracle, Dynatrace, Datadog, Stackify Retrace, and New Relic, to name a few, are just some of the companies and tools that innovate in the field of monitoring today. Each of them is aiming at a particular segment, technology, or OS ecosystem and offering its own set of features.