Rethinking Reliability: What you can (and can't) learn from incidents

What exactly is an incident? There are many different definitions of what makes an incident in various industries. If I were to reduce them all to a common denominator, the focus would most often be on unintended and often unforeseen events that cause unwanted consequences. There are also those intentional events with the sole purpose of causing damage to a system. The author of this article has 11 years of experience in such organization (Fintech industry) and over several thousand successfully resolved incidents of various priorities.

In this blog, I will try to bring you closer the topic of incidents using the example of an organization that has a large IT department as a support to the organization's primary business. The IT department of this organization consists of various departments, among which are the Department for Application Development and Application Support, Database Administrator Department or DBA, Department for Telecommunications and Networks, Production Department, Helpdesk, Service Desk, etc.

In the IT world, an incident is defined as an unplanned interruption of the functioning of the entire service or part of the service, as well as reduced service quality. Incidents can be of lower priority or importance - something that does not affect the overall operation of the system, or of high priority or importance - something that affects the security and stability of the entire system. The problem is the unknown cause of one or more incidents. The priority matrix or incident matrix is defined according to the concept of ITIL (Information Technology Infrastructure Library) where the impact and urgency are the primary determinants according to which the relative priority of resolving the incident is determined. Each organization defines its own matrix of incidents, so in addition to impact and urgency, business service can also be defined as one of the parameters. By determining the impact and urgency in the matrix in combination with the other parameters that the matrix has, it can be said that most of the incidents are divided into at least three basic categories:

Low – low-priority incidents
Medium – medium-priority incidents
High – high-priority incidents

There are two basic levels of support:

HelpDesk is the first level that receives complaints and reported incidents using the tool for logging incidents/problems or by secondary channel, email or phone call, by which is opened again through the tool for logging incidents or problems. The HelpDesk service tries to solve and eliminate the problem if it concerns the spectrum of actions they are trained for. If they are not, the incident is forwarded to the second level of support.
Service Desk is the second level of support that receives incidents reported to the first level and tries to resolve the incident if the resolution procedure is already known and predefined. If not, other support services get involved, including the application service, the DBA service or, for example, an external vendor for a specific service.

Incident resolution

Depending on the category of the incident, one person, one team or several teams from different parts of the organization can participate in its investigation or resolution. An incident may initially be classified as a low priority incident and then, during the investigation, discovered that it is a high priority incident, so it is inevitable to involve additional teams to analyze and resolve it. If one person worked on the incident analysis, after completing the investigation, that person must write the report stating the cause of the incident, how it was resolved, and proposing steps to prevent it from happening again. If the cause of the incident is application-related, the proposal of program refinement could be enough to prevent the same or similar incident from happening again.

Such a proposal is entered into the incident database. After being analyzed, the priority of its implementation is defined. If several different teams participated in the incident resolution, it is necessary for each of the teams to make an analysis stating what caused the incident, how they participated in resolving it, and proposing how to prevent the same or similar incident from happening again. The ideal situation is to have a department that analyzes incident reports submitted by various departments within the organization. The role of such a department would be to monitor how and in what time frame the proposed solutions will be implemented in order to prevent the recurrence of incidents.

Learning from incidents

Learning from incidents is a key part of organizational development, and can be defined as a process through which employees and the organization as a whole strive to understand all the events that occurred and caused the incident in order to prevent the same or similar incidents from happening again in the future. The focus should be on ensuring that the lessons learned from investigating what led to the incident are implemented in the near future. According to the ISO 45001 standard, investigating the root causes of incidents, implementing appropriate measures and communicating them throughout the organization is considered key to the process of continuous improvement.

Learning from incidents requires modifying existing knowledge about a part of the system or acquiring completely new knowledge necessary for the operation of the system as a whole.

The process of detecting and resolving incidents can be divided into several steps:

Some incidents can be resolved with a workaround to keep the system running without major downtime. It is used most often when it is necessary to quickly eliminate an incident in order to enable the operation of the system, while preparing a permanent solution for later implementation. As the professional literature often states and points out as extremely important, in order for the organization to learn from incidents, certain factors must be present. One of these factors is organizational safety culture. It ensures transparency, as well as a no-blame culture and providing feedback on unsafe behavior.

Incident response

Another important factor for an organization is having a clearly defined incident response process when an incident occurs. The procedure will not be the same for all types of incidents. A low-priority incident usually involves a single team. In that team, the person most competent to solve the incident will take over the analysis of the incident and, if necessary, involve other people from the team or ask another team for help if an analysis from another part of the organization is required.

Medium-priority incidents usually require analysis by two or more teams. Such incidents are identified by the fact that in the execution chain of a process we can identify a problem on at least two system components that are directly connected. In the IT world, this means involving a development team at the application level for analysis and a DBA team that will analyze the behavior of the database and the problems observed in that segment. This does not mean that a medium priority incident necessarily requires the involvement of several people. Only one person can be in charge of an incident in the production process.

High-priority incidents are those that require the fastest response because the operation of the entire system is at risk. In these cases, it would be good if the procedure mandates the teleconference of all teams relevant to the analysis and problem solving. The organization raises an incident of this priority when a problem has been identified at multiple levels and all levels are connected. This means that the teleconference usually includes the development team in charge of the application, the DBA team in charge of the database, the network specialist team in charge of monitoring the network operation and various other teams, depending on the scope of the problem caused by the incident and the expertise of the teams required for analysis. After such an incident, it is extremely important to take all steps as soon as possible so that it doesn’t happen again.

Teams can later create isolated environment where they will try to reproduce the steps that originally caused the incident and develop a solution that will prevent a similar or the same situation from happening again. Such environments and the time spent on them are crucial for verifying the scenarios designed to prevent new incidents because we can test the vulnerability of the process in the organization or the safety of the entire organization.

Unfortunately, today there are also organizations that analyze incidents and spend countless hours to resolve the incident, but do not perform a supporting triage and do not ensure that such an incident doesn’t happen again.

The thing that can be least influenced when it comes to the repetition of an incident is the human factor. We could have a clearly defined procedure of 20 steps to be taken to carry out one process. And if in this process one or more people are in charge of carrying out some of the steps, the human factor cannot be excluded as a potential error that may occur.

Industries:

Technologies /
Methodologies: