Usually we think of logging and tracing data as continuously updated long chronologically ordered lines or records organized in plain text files with a relatively simple purpose – to keep record of the operating system, network, application or of service events and activities. According to this general view, the purpose of logs is to enable audit trail information about what happened behind the scenes in order to find what and when something went wrong, what could be improved, and sometimes even to find out that a component is not working at all. Traditionally, the log data is stored in log files situated inside a specific operating system or application folder. The difference between logging and tracing lies mostly in the density of information. When logs keep record of discrete events on a higher level, tracing encompasses a very detailed continuous view of an application or service activity.
Maintaining logs and trace files can be expensive. Software components need to be arranged in order to generate logs or trace records. Logs occupy storage. They use processor time and memory, and in many cases, network resources, as well. They also require manual maintenance from time to time. It’s wise to think twice when deciding on detailed tracing for an application, application component or service. For example, database systems keep their redo log files in the form of a change history of all the changes made in a relatively short time period.
Metrics should primarily be observed in the context of monitoring either the information system or some other technical system, e.g. in the manufacturing or heavy machinery domain.
The main question is where these metrics originate. In fact, collecting and aggregating metrics is based on the log or trace records or on a very similar concept, to the level that some well-known data streaming platforms store data records or messages in the form of logs.
Each record consists of an event source identification key, value, and timestamp precise to the millisecond. When the record timestamp is known, it’s possible to calculate the timespan between two same event types recurring occurrences. It’s also possible to aggregate time spans for events on different time dimensions like hour, day, month, or quarter. Moreover, the record value can contain attributes to be used as additional dimensions and measures. Hence, when we have data for aggregated measures and dimensions, including the time dimension, we have everything we need for creating monitoring visualizations and reports in the form of various charts, tables and notifications. This could be applied on either information systems or other technical systems in the manufacturing, construction, or logistics domain and maybe even in the weather conditions tracking domain. Two things are always common elements – events and data. An event can, for instance, be triggered by repetitive sensor readings, by device status change, when a measured value reaches a specific value, when a service sends specific data, or when a component failure happens.
Considering the amount of information generated by events, there should be a possibility for automatic data analysis, routing, transformation, and distribution.
Due to numerous applications and services being deployed on-premise, on cloud or in a combination of both, it’s very hard to find out potential problems, threats and behaviors at the right time by performing manual analysis. Therefore, it’s important to establish procedures for automatic log analysis and notification.
Log analysis serves for the following purposes:
Log analysis of IT infrastructure management on a big scale requires reliable tools and thoughtfully designed implementations to get business value from the logs’ data. The acquired information about trends and system behavior can be information about network traffic, errors, uptime, user access, the amount of transferred data, etc. In fact, it includes all types of information which can be used to make business or technical decisions by IT managers, system administrators, software developers or security officers.
In order for log data to be useful for analysis, it needs to follow several rules and practices.
Obviously, just generating and writing log records is only half of the effort in order to gain real benefits. With automatic log analysis, it’s possible to quickly mitigate risks, identify threats, find the root of the incident, and respond quickly and adequately.
Once the log data is centralized and structured, the final step is monitoring and alerting.
Sometimes the whole process of data collection, aggregation and analysis is called the monitoring system. However, since data sources can be distributed among many nodes, transported over different communication channels, and processed by a large number of Big Data or AI systems, it’s hard to define all of them as a unique monitoring system. Hence, maybe it is better to consider monitoring and alerting as a representation and report of the final result based on the generated metrics.
The data generated by events or as sensors measured values can be stored as log files and analyzed manually after something has already gone wrong. Such an approach is indeed a waste of time and money, especially when we have advanced technologies for data streaming, analysis, routing, and transformation at our disposal. One log record could participate in many correlated analysis aspects. No matter if raw or aggregated, it can be transported through EAI/ESB, a message queue, microservices or data streaming systems into an SQL or NoSQL database or BI storage, processed by Big Data or AI software, and finally visualized on the dashboard as charts and gauges or it can generate alerts by sending notifications to mobile devices, via e-mail or specialized alerting devices. The possibilities and applications are almost endless, ranging from a single desktop PC, smart home appliances and control systems, to industrial plants and agricultural fields. Correctly implemented events and metric management systems will result in better system performances, higher uptime, fewer resources lost, quality of service improvements, and a higher security level.
Any questions? Let us know: