ITSM

Problem Management: Most Problems Don’t Have a Single Root Cause

July 30th, 2019 - 1199 views

6 min read

Most Problems Don't Have A Single Root Cause

If your problem management just identifies one root cause for each problem, you’re missing out on lots of opportunities to improve.

One of my clients had a problem that caused lots of incidents. They investigated the problem and identified the root cause. It was, they agreed, a bug in a software application that had been developed in-house. They fixed the bug and closed the problem.

This is typical of what I see happening in problem management, and superficially it looks OK. But actually, it isn’t good enough, because you need to ask a lot of awkward questions if you want to understand what really happened.

Thorough investigations demand answers to awkward questions

A thorough investigation of this problem would have asked lots of awkward questions like:

Why was the software incorrect in the first place?
- Was the specification correct?
- Was this a simple coding error?
- Was there a failure to understand the business process?
- Did the developer correctly understand how the user would interact with the software?
- Did the software developer have all the information they needed?
- Did the software developer have the correct skills and experience?
Why was the software bug not detected before deployment?
- Did a peer review take place? Was the peer reviewer competent and experienced enough? Did they have the information they needed?
- Was testing performed? Was the test environment fit for purpose? Was the testing scope sufficient?
Why did it take so long to identify that there was a problem after the first incident(s)?
- Did the service desk agents have the skills and knowledge they needed to identify problems?
- Did the service desk tool have the capability needed to assist in identifying multiple related incidents?
- Was there a process for reviewing incidents to ensure that problems were identified? Was this process effective?
Why did it take so long to diagnose the problem after it had been identified?
- Were people with the right skills and knowledge available to diagnose the problem?
- Was all the information needed to diagnose the problem available when and where it was needed?
- Was the problem prioritized appropriately relative to other work?
Why did the problem cause so much business pain after it had been identified?
- Was a satisfactory workaround documented?
- Were future incidents needing the same workaround quickly and consistently identified?
- Was the workaround implemented quickly and effectively after each incident?
- Did someone review the workaround after it had been used to identify how it could be improved?
Why did it take so long to deliver a software fix into production?
- Were appropriate resources available to develop and test the solution?
- Was this prioritized appropriately relative to other work?
- Did the process for testing and deploying the solution introduces unnecessary delays?

If your problem management only identifies one root cause for each problem, you’re missing out on lots of opportunities to improve. - @StuartRance #ITSM Click To Tweet

How many causes does one problem have?

The questions I listed above are typical of what you need to ask if you want to understand

Why the problem happened
Why the problem had such an impact
How to reduce the likelihood of similar issues causing problems in the future

The fact is that almost any problem has many causes. Some of them may be technology-related (like a software bug, or a faulty laptop), and ‘information and technology’ is indeed one of the dimensions of service management identified by ITIL 4. But there are three more dimensions of service management you need to consider if you want to thoroughly investigate the causes of a problem. Because causes may be related to ‘organizations and people’ (skills, competence, knowledge), ‘value streams and processes’ (development, testing, incident management) or even ‘partners and suppliers’ (contracts, relationships).

The first step towards improving is knowing what to improve

When you think about solving problems by uncovering the “root cause” it’s likely that you’ll identify a technology-related issue, fix it, and stop there. If you take this approach, the chances are that you won’t notice any other things that didn’t work as well as they could have, and as a result, you’ll miss many opportunities to improve and to reduce the number and impact of problems you see in the future.

What’s worse is that if you don’t take the time to identify your own weaknesses you can all too easily find yourself caught up in an endless, and unnecessary cycle of fixing one “root cause” after another.

Keep an improvement register, and make sure that every problem investigation is done thoroughly, taking into consideration all the dimensions of service management. When you take this approach, every investigation will throw up many improvement opportunities that you can identify and log. Whether or not to invest the resources needed to address them is a decision you can take later, but once they’ve been identified and prioritized, at least you know what they are and have thought about the damage they might do if left unaddressed.

If you need more information about improvement registers and continual improvement, here are some blogs and papers I have written:

How can you identify causes of problems?

There are lots of different techniques for identifying things that cause problems. I’ve written a blog titled 7 Ways to Diagnose IT Incidents and Problems, that describes some of the more popular ones. Here’s a quick summary if you don’t have time to read the blog.

The Richard Feynman Approach: Write down the problem; think very hard; write down the answer
Timeline Analysis: List everything that happened, in time order; look for patterns
Kepner-Tregoe Problem Solving: Document the problem in terms of What, Where, When and Extent; identify what is not failing as well as what is; list differences and changes; identify possible causes; verify the true cause
Ishikawa or Fishbone diagrams: Draw a diagram showing all the possible contributory causes and the links between them
Knowledge-Centred Support: Capture and manage information as part of routine incident handling
Swarming: Collaborate rather than escalating
Standard+Case: Distinguish between routine work and more complex situations

Two more techniques that I also find helpful are

Ask a friend: Don’t just work by yourself, chat to your colleagues and draw on their knowledge and expertise. You could call this “informal swarming”
5 whys: This technique from Lean encourages you to repeatedly ask Why, rather than accepting the obvious root cause

Of course, you don’t have to pick just one of these, you can use a combination of them as appropriate to your situation.

Summary

If your problem management just identifies one root cause for each problem, you’re missing out on lots of opportunities to improve. You can use the four dimensions of service management described in ITIL 4 to help ensure your investigations cover all aspects of service management; organizations and people, partners and suppliers, value streams and processes, and information and technology. This should, in turn, help ensure that your problem investigations don’t just fix a technical “root cause” but also contribute to a culture of continual improvement by uncovering issues to be added to your continual improvement register.

Learn about SysAid Problem Management

About

the Author

Stuart Rance

Stuart is an ITSM and security consultant, trainer, and author who has worked with clients in many countries, helping them create business value for themselves and their customers. He was the author of the 2011 edition of ITIL® Service Transition and lead author of RESILIA™ Cyber Resilience best practice published in June 2015. Now that his children have all left home, he has plenty of time on his hands for contributing to our blog – lucky us!

Thorough investigations demand answers to awkward questions

How many causes does one problem have?

The first step towards improving is knowing what to improve

How can you identify causes of problems?

Summary

You'll Love This Too!

SysAid On-Prem Software CVE-2023-47246 Vulnerability

2024 ITSM Trends – “Do Existing Things Better”

Considering Generative AI for ITSM? Here’s What You Need to Know

Prioritizing ITSM Improvements Based on Employee Experience Impact

Preparing IT Staff for AI Use

The Best Way to Prevent Incidents

About

the Author

Stuart Rance