To ensure stability of IT services, organizations need to become best-in-class at incident management. Even at leading organizations, incidents can and do occur with alarming regularity. User mistakes, IT system failures, end-of-life issues, changes and configurations, and other issues can cause incidents that often cascade down, impacting more systems and business processes.
Most organizations have a dedicated IT service management function, rely on integrated tools, and measure how swiftly and effectively teams can identify and resolve issues. Some also use methodologies such as ITIL 4 to implement standardized processes and improve service quality.
A survey by FreshService, an IT service management company, finds that teams are able to resolve 94 percent of issues on average, but only 70 percent during the course of the first interaction. (See definition below.) For example, many common desktop issues, such as resetting passwords or reestablishing network connections, can easily be resolved by an ITSM team member working with an end-user over the phone or via email or chat.
Other issues, such as one or more application failures, may not be solvable on the first contact. Teams may need to perform investigatory work to determine root causes of issues. By doing so, they can improve processes and prevent issues from recurring.
When incident management is done right, it minimizes the impact of IT issues on employees, customers, and business operations. Fast issue resolution protects critical business processes, minimizes data loss, and helps prevent security breaches.
Common KPIs to Use for Incident Management
Organizations use key performance indicators (KPIs) to evaluate the success of IT incident management processes. These KPIs include:
- Mean time to resolution (MTTR) – The average time it takes to detect, diagnose, and recover from an issue.
- Mean time to acknowledge (MTTA) – The average time from when an issue is detected (such as via an alert or end-user) to when an ITSM team member acknowledges receipt of the issue and begins work on it.
- First contact resolution rate – First contact means that teams are able to resolve the issue during the initial interaction with an end-user, eliminating the need for further follow-up.
- Escalation rates – The percentage of support tickets that have to be escalated to a new support tier, such as senior team members or specialists.
- Mean time between failures — The average time between a solution’s repairable failures.
- Number of repeat incidents – The number of issues that recur. This typically indicates that ITSM teams may have developed workarounds, but haven’t solved root causes of these issues.
- Incident reopen rates – The percentage of tickets that have been solved and closed out that are reopened by users. This can be due to unresolved issues or additional customer comments on a closed issue.
- Incident backlog – A list of issues that haven’t yet been addressed. Incidents are addressed by priority, meaning that low-impact issues may linger.
- Percentage of major incidents – The percentage of issues that are rated either Sev 1 (critical incidents with a high impact) or Sev 2 (major incidents with a significant impact). Sev is short for severity.
- Cost per ticket – The average cost to resolve incidents. This sum is calculated by taking the total number of incidents and dividing them by total costs, such as labor, technology, training, and other related costs.
- System uptime – The amount of time a computing resource or application is fully operational.
- End-user satisfaction rates – The level of satisfaction end-users have with ITSM support.
- SLA compliance – How closely key metrics hew to predetermined service-level agreements (SLA).
Common Tools Used to Handle Incidents
IT service teams use a variety of IT incident management tools to streamline their work, since every minute counts. Here are a few of the products team use to receive notice of issues, identify problems, understand their priority, and resolve them:
- Performance management tools monitor the health and availability of systems. They also send alarms when a system is degraded or down. There are performance management tools that monitor and manage the network, others that focus on servers and hypervisors, and some that monitor applications and performance.
- Configuration management databases (CMDBs) and IT asset management (ITAM) solutions provide up-to-date, detailed information on software, hardware, and virtualized assets, whether they are on-premises or in the cloud. A CMDB also provides dependency mapping tools, showing the links between computing resources and the applications they support. As a result, ITSM teams can more easily determine the impact of a problem on related systems and business processes.
- IT service management (ITSM) tools collect, sort, and assign incident alarms to IT teams. They also centralize all incident data and provide diagnostic, troubleshooting, and issue resolution tools. Platforms like JIRA, ZenDesk, Freshworks, and ServiceNow are examples of these ITSM tools.
- Root cause analysis (RCA) tools accomplish just one task, and that is to execute suppression of fast and automated downstream alarms to quickly identify the root cause of the problem. In most cases, this is a mathematically challenging problem to solve, which is why no product has perfected it. Typically, ITSM teams need to perform additional diagnostics to identify the root issue.
- Other tools such as chatbots and virtual agents help teams gather information, with prompts guiding affected users through the process of providing all relevant information. Chatbots can also offer links to knowledge bases, enabling end-users to try and resolve their own issues, allowing teams to focus on higher-level problems.
Steps to Resolving IT Incidents
IT service teams typically follow these recommended steps to address and resolve issues, which help them improve their MTTR metrics:
- Identify the issue: The first step in managing an IT incident is to identify that it has occurred. IT service teams may diagnose issues proactively, by monitoring systems and detecting anomalies, or they may receive automated alerts. Alternatively, they may report incidents using standard channels, such as phone, email, chat, and more. These teams should look for performance management tools that have built-in capabilities and are trained to detect anomalies and outages.
- Responding to the issue: Once an incident has been identified, the IT service desk is on the clock to respond promptly and appropriately. They should build and follow standard processes — such as who, when, and how to gather relevant information; how to activate an incident response plan; which relevant teams and stakeholders to coordinate with; and what steps need to be taken to mitigate the impact of the incident.
IT service team members typically take turns responding to routine incidents. However, certain types of incidents may be automatically routed to experts who can resolve them faster. And incidents that are deemed to be Sev 1 or Sev 2 may have large teams working on them concurrently. - Resolving the issue: Next, the assigned IT service team member needs to resolve the incident as quickly and efficiently as possible. This may involve troubleshooting and repairing the issue, implementing a workaround, or escalating the incident to higher levels of support as needed.
- Reviewing the issue after it’s resolved: After an incident has been resolved, it is important to conduct a thorough review to understand what happened, identify any underlying issues that may have contributed to the incident, and implement measures to prevent similar incidents from occurring in the future.
Invest in the Right Tools and Processes to Empower Teams
IT service management teams are always seeking to improve processes. In addition to applying the ITIL 4 methodology, IT service team members should uplevel their skills with ongoing training, work to improve knowledge bases, and communicate and collaborate more effectively.
As organizations continue to digitize business models, processes, and services, IT systems are becoming ever-more critical. Tools like Device42 help shave time off issue diagnosis, speeding the time to resolution.
By investing in continuous improvement, IT organizations can ensure that teams have the right IT incident management data, tools, best practices, and knowledge they need to resolve issues swiftly, reducing operational disruptions and protecting the business, customers, and revenues from harm.
Next-generation ITAM (IT Asset Management) and dependency mapping products like Device42 auto-discover all assets from the ground up. Device42 provides a living record of critical device information, such as names, locations, utilization, patches, updates, and links to financial records. Device42 also maps dependencies between devices and applications, enabling an IT service team to understand these critical relationships. This information proves invaluable when incidents strike. Team members are able to prioritize issues by impact, contact affected stakeholders, and more quickly zero in on the sources of problems.