Service Desk

How to Manage Your IT Problems

October 5th, 2016 - 457 views

6 min read

IT workarounds

IT organizations often spend huge amounts of time, money, and other resources on managing incidents, but they spend surprisingly little on problem management work that might reduce the number of incidents in the first place. This is often due to poor understanding of the difference between incidents and problems, and insufficient knowledge or understanding of how to manage problems.

Why Do We Distinguish Between Incidents and Problems?

Many people confuse incidents and problems, so let’s start by making the distinction clear

An incident is an unplanned interruption to an IT service, or a reduction in quality of a service. Incidents have an impact on users, and on the business, and the purpose of incident management is to restore normal service.
A problem is the cause of one or more incidents. The purpose of problem management is to manage the problem, eliminating future incidents where possible, and reducing the impact of incidents that can’t be prevented.

So, incident management helps you get the business working again, problem management helps you prevent future incidents, or at least make them less painful when they do happen.

In the bad old days, when IT was a very technically-focused function, most IT teams didn’t distinguish incidents from problems. If something broke, then somebody would work on it until it was mended. IT technicians paid little attention to the business impact of whatever it was they were working on, and the customer just had to wait until it was fixed. But when we learned to distinguish between incidents and problems, this changed. Organizations could set up incident management to focus on doing whatever is needed to get the business working again, leaving problem management to deal with any underlying technical issues. Take, for example, a printer that isn’t working. There is no need for the customer to wait till the printer is mended. When we practice incident management we can simply help the customer route their printout to a different printer. Obviously, this doesn’t get the printer fixed, but that is probably something that doesn’t really concern the customer. We can of course go on to repair the faulty printer. But this is now a problem management activity, with lower priority as the outcome has very little direct impact on the customer.

How Do You Identify Problems?

Before you can start managing your problems, you need to identify them. Here are some common ways that organizations identify problems:

Trend analysis of incident records. This activity is sometimes known as “proactive problem management” because its purpose is to help you identify problems that you don’t already know about.
Major incidents. Every major incident should result in a problem being logged, to ensure you take steps to prevent the same thing from happening again.
Multiple incidents at the service desk. Service desk agents should notice if there are multiple similar incidents being logged and log a single problem to ensure that these are investigated and the cause addressed.

These are some of the ways that problems can be identified, and you should make sure that you take full advantage of these, but also think about all the different ways that a problem could be identified in your organization. For example, think about your suppliers, software developers, or infrastructure teams, and make sure that you actively capture all of their input and use it to log problems.

How Should You Manage Problems?

Problem management has two objectives:

To prevent incidents from happening
To reduce the impact of incidents that cannot be prevented

Many organizations only think about the first of these objectives. They do root cause analysis to understand the problem, and then take steps to rectify whatever was causing it. This may take some time, and while a complete technical solution may prove very effective once it is in place, the business is likely to continue suffering in the meantime.

The best organizations that I work withstart by thinking about how to reduce the impact of incidents – the second of our two objectives. They ask themselves, “What should we do if this happens again right now?” This might be a difficult question for technical support staff to answer if they don’t fully understand the problem, but it’s much better than just leaving service desk agents to flounder when the same thing does happen again. Organizations with really effective problem management create workarounds for problems as quickly as they can. They make sure there is a well-documented workaround in place as quickly as can be managed, and they also review the workaround every time it’s used to identify possible improvements. And with the latter, they will also go on to improve the workaround as they learn more about the cause.

So one benefit of thinking about reducing the impact of incidents, before you start analyzing their root causes, is that you reduce any business impact much faster.

But there’s a second benefit – sometimes it turns out that a workaround is so effective that there is actually no need to understand the root cause or fix it. Here’s a perfect example:

One of my friends had a gas leak under the concrete floor in his kitchen. He turned off the gas and called out the emergency gas fitter. This gas fitter ran a pipe around the outside of the kitchen so that the gas cooker could be reconnected and arranged for a structural engineer to diagnose the leak properly, so they would know where to dig up the kitchen floor. The structural engineer called a week later and said it would cost a huge amount of money to dig up the concrete and fix the pipe, but fortunately this was unnecessary because the new gas pipe was perfectly safe and did the job.

This example doesn’t involve IT, but I’ve used it anyway because it is a particularly good illustration of the fact that a safe and effective workaround is often enough. Why waste good money fixing an application when you have a workaround to the problem and it no longer has an impact on the business?

Of course, if your workaround is not sufficient then you do need to investigate the root cause of the problem. There are many different techniques you can use for this. My favourite approaches are timeline analysis and Kepner-Tregoe problem solving.

Timeline Analysis

This is such an easy way to investigate a problem that it barely deserves a name. You simply list everything that happened in time order and then look for patterns. What is important is that you get all the data from multiple sources and then sort it by date and time, regardless of where it came from. So your timeline may have entries from system logs, emails, service desk records, and many other sources. This simple approach is surprisingly effective at building a complete picture of what’s been going on.

Kepner-Tregoe Problem Solving

I have to declare an interest here, I used to teach this proprietary approach to problem solving, but I do think that it is incredibly effective. This is a very structured approach to problem solving, where you define the problem across a number of different dimensions (what, where, when, extent) and you also bound the problem by identifying what is NOT failing. You can then review the distinctions between these to identify possible causes.

Summary

Problem management presents a great opportunity to reduce both the number and the impact of IT incidents on your customers. If you are not already doing problem management you should certainly be thinking about introducing it. And once you do, I think you may find my guide to problem management metrics worth a read.

If you are already doing problem management, then make sure that you have the balance between devising workarounds and investigating root causes right. You don’t always have to understand the cause to resolve the problem, but you do always need to put a workaround in place as quickly as you can. You may even want to think about how to integrate incident and problem management, but whatever approach you take you should make sure that you focus on value. Think first and last about what will be best for your customers and users.

About

the Author

Stuart Rance

Stuart is an ITSM and security consultant, trainer, and author who has worked with clients in many countries, helping them create business value for themselves and their customers. He was the author of the 2011 edition of ITIL® Service Transition and lead author of RESILIA™ Cyber Resilience best practice published in June 2015. Now that his children have all left home, he has plenty of time on his hands for contributing to our blog – lucky us!

Why Do We Distinguish Between Incidents and Problems?

How Do You Identify Problems?

How Should You Manage Problems?

Timeline Analysis

Kepner-Tregoe Problem Solving

Summary

You'll Love This Too!

SysAid On-Prem Software CVE-2023-47246 Vulnerability

Working in IT: 5 Tips for Dealing with Undermining Behavior

Getting Started with Continual Improvement

7 Key Steps for When Your IT Service Desk is Struggling

6 Tips to Stay Motivated on the Service Desk

9 Guiding Principles That Can Help Improve Your Service Desk

About

the Author

Stuart Rance