Perhaps nothing is more essential to the execution of IT tasks than troubleshooting. You can gather requirements, design, plan, and optimize until the cows come home but at some point something is going to go unexpectedly wrong. That's when the process of troubleshooting comes to the forefront.
Most engineers think they are good troubleshooters, applying experience, intuition, and often brute force to beat any problem into submission. But, faster and less painful solutions can be found by applying a structured, intentional approach:
Gather Good Information
Find the Right Problem to Solve
Validate Your Solution
GATHER GOOD INFORMATION
You can't troubleshoot something you don't understand. This starts with RTFM (Reading the Fine Manual). Well, that's the G rated version of the acronym but the advice still stands. Gather as much information about the system you are troubleshooting as you can. If there is a manual, read it. All of it.
Understanding the system means that when you fix something you'll be less likely to break other things. Know how the system interacts with other systems and figure out what "normal" operation is. You can't recognize abnormal behavior if you don't know what normal is like.
Understand your toolkit. Be completely familiar with the software and hardware tools at your disposal so you aren't trying to learn their use while diagnosing a system problem. In other words, be prepared.
Interview those affected by the failure. Find out the last time the system operated properly. What changed since then that might have affected the system? Be sure to look for seemingly unrelated events.
Are you getting an error message? Do a Google search on the error message and see what others may be reporting about causes and the potential solution.
REPRODUCE THE FAILURE
The first step in troubleshooting is to try to reproduce the reported failure. In most cases, the failure will be reported to you second hand and the information may be inaccurate or misleading. There are several reasons to reproduce the failure:
First Hand Evidence - Reproduce the failure so you can see it happen. Extra points if you can make it fail at will.
Indications of Cause - Knowing the conditions under which the failure occurs will provide great insight into the possible causes.
So You Know if You Fixed It - The only way to validate that you've actually fixed the problem is to execute the steps that produce the failure and NOT to see the failure.
Write down the steps you take to create the failure, follow those steps and make it fail again. When in doubt, start at the beginning. Reboot the system and start from a clean testing condition but try to find conditions that lead to a reproducible failure.
INTERMITTENT FAILURES
Many tough problems are intermittent. You may have seen the failure once but your attempts to reproduce the failure don't have the same starting conditions, inputs, events, or outside influences. Many times we cannot control all of the influencing factors in the system.
So, what do you do? Start by trying to catalog all of the potential conditions affecting the system you are troubleshooting. Write them all down. Control and vary those conditions one at a time to get the problem to behave differently. Hopefully one of those changes will cause the problem to occur with different frequency, intensity, or outcome and that will suggest an additional avenue of investigation.
What if it is STILL intermittent? Capture more information when the failure occurs and gather data from as many failures as possible. Analyze the data for common characteristics and conditions. And, don't assume that just because you haven't seen the failure in the last 20 tests that the problem is fixed. If you didn't fix it then it isn't really fixed. Few problems are self-correcting.
OBSERVE OBJECTIVELY
Most engineers jump to conclusions about the cause of a problem prematurely. Make sure you really look at the behavior of the system. Stop thinking and just observe the system in a completely objective, dispassionate, cold, robotic manner. See the failure occur in detail. Typically we get reports of the result of the failure but not the details of the failure itself. Try to observe the failure occurring in detail. Apply instrumentation to the system to gather more information about the conditions and behavior of the failure. Enable system notifications and logging but be aware that the act of actively observing the system can alter its behavior (Heisenberg uncertainty principle).
FIND THE RIGHT PROBLEM TO SOLVE
Solutions are frequently obvious. It's finding the right problem to solve that's the hard part. If your initial problem domain is the entire system you are troubleshooting, find a way to cut the system in half. Observe the behavior in each half of the system. If the problem occurs in one half of the system, cut that part in half again and repeat until you've narrowed the scope of investigation as far as possible.
CHANGE ONE THING AT A TIME
If you change multiple things at once and the problem goes away, you'll never know which change was the one that fixed the problem. The same thing applies to the tests you are using. Since the tests or instrumentation you use can affect the problem, change one test at a time. When in doubt, apply the same tests to a known good system and compare the data.
KEEP A LOG
Write down what you did, when you did it and in what order, and what happened. Be detailed! You may need to refer back to your log for additional insight and to have data to correlate with other systems or observations. Don't trust your memory!
QUESTION YOUR ASSUMPTIONS
First, you need to know what assumptions you are making. This is easier said than done! Stop and step back from the problem and try to take the place of an external observer. Have you made assumptions about the situation or system behavior without realizing it? Think divergently about all the implied assumptions that may have been made. Question each of these assumptions. Is there a test that can be performed to confirm or deny the assumption? Have you assumed that your test or tool is accurate and working properly? Can you validate your test or tool to be sure it is providing valid information?
A FRESH PERSPECTIVE
It's easy to get so dug into a problem that it is impossible to see the forest for the trees. Ask for help. Get another set of eyes to lend a fresh insight. When asking for help, report the symptoms and observations, not your theories. Be receptive to the input of others.
VALIDATE YOUR SOLUTION
If you didn't fix it, it's still broken. Don't assume that your action fixed the problem. Prove it! If you have a sequence of steps that reliably reproduces the failure, repeat those steps and validate that the problem does not occur. If you are unsure if your fix really did address the issue, remove the fix and make the problem occur again. Then, place your fix back into place and verify that the problem does not occur. If you can make the problem occur and not at will, you've clearly found the issue and a fix.
Be sure you fixed the cause of the problem and are not just masking the result. Remember, problems never just go away by themselves. You need to be sure you really did fix it.
THE BOTTOM LINE
We frequently rely on our experience and intuition when troubleshooting, but applying this structured approach can yield better quality solutions in less time and with fewer unwanted side effects.
Most engineers think they are good troubleshooters, applying experience, intuition, and often brute force to beat any problem into submission. But, faster and less painful solutions can be found by applying a structured, intentional approach:
Gather Good Information
Find the Right Problem to Solve
Validate Your Solution
GATHER GOOD INFORMATION
You can't troubleshoot something you don't understand. This starts with RTFM (Reading the Fine Manual). Well, that's the G rated version of the acronym but the advice still stands. Gather as much information about the system you are troubleshooting as you can. If there is a manual, read it. All of it.
Understanding the system means that when you fix something you'll be less likely to break other things. Know how the system interacts with other systems and figure out what "normal" operation is. You can't recognize abnormal behavior if you don't know what normal is like.
Understand your toolkit. Be completely familiar with the software and hardware tools at your disposal so you aren't trying to learn their use while diagnosing a system problem. In other words, be prepared.
Interview those affected by the failure. Find out the last time the system operated properly. What changed since then that might have affected the system? Be sure to look for seemingly unrelated events.
Are you getting an error message? Do a Google search on the error message and see what others may be reporting about causes and the potential solution.
REPRODUCE THE FAILURE
The first step in troubleshooting is to try to reproduce the reported failure. In most cases, the failure will be reported to you second hand and the information may be inaccurate or misleading. There are several reasons to reproduce the failure:
First Hand Evidence - Reproduce the failure so you can see it happen. Extra points if you can make it fail at will.
Indications of Cause - Knowing the conditions under which the failure occurs will provide great insight into the possible causes.
So You Know if You Fixed It - The only way to validate that you've actually fixed the problem is to execute the steps that produce the failure and NOT to see the failure.
Write down the steps you take to create the failure, follow those steps and make it fail again. When in doubt, start at the beginning. Reboot the system and start from a clean testing condition but try to find conditions that lead to a reproducible failure.
INTERMITTENT FAILURES
Many tough problems are intermittent. You may have seen the failure once but your attempts to reproduce the failure don't have the same starting conditions, inputs, events, or outside influences. Many times we cannot control all of the influencing factors in the system.
So, what do you do? Start by trying to catalog all of the potential conditions affecting the system you are troubleshooting. Write them all down. Control and vary those conditions one at a time to get the problem to behave differently. Hopefully one of those changes will cause the problem to occur with different frequency, intensity, or outcome and that will suggest an additional avenue of investigation.
What if it is STILL intermittent? Capture more information when the failure occurs and gather data from as many failures as possible. Analyze the data for common characteristics and conditions. And, don't assume that just because you haven't seen the failure in the last 20 tests that the problem is fixed. If you didn't fix it then it isn't really fixed. Few problems are self-correcting.
OBSERVE OBJECTIVELY
Most engineers jump to conclusions about the cause of a problem prematurely. Make sure you really look at the behavior of the system. Stop thinking and just observe the system in a completely objective, dispassionate, cold, robotic manner. See the failure occur in detail. Typically we get reports of the result of the failure but not the details of the failure itself. Try to observe the failure occurring in detail. Apply instrumentation to the system to gather more information about the conditions and behavior of the failure. Enable system notifications and logging but be aware that the act of actively observing the system can alter its behavior (Heisenberg uncertainty principle).
FIND THE RIGHT PROBLEM TO SOLVE
Solutions are frequently obvious. It's finding the right problem to solve that's the hard part. If your initial problem domain is the entire system you are troubleshooting, find a way to cut the system in half. Observe the behavior in each half of the system. If the problem occurs in one half of the system, cut that part in half again and repeat until you've narrowed the scope of investigation as far as possible.
CHANGE ONE THING AT A TIME
If you change multiple things at once and the problem goes away, you'll never know which change was the one that fixed the problem. The same thing applies to the tests you are using. Since the tests or instrumentation you use can affect the problem, change one test at a time. When in doubt, apply the same tests to a known good system and compare the data.
KEEP A LOG
Write down what you did, when you did it and in what order, and what happened. Be detailed! You may need to refer back to your log for additional insight and to have data to correlate with other systems or observations. Don't trust your memory!
QUESTION YOUR ASSUMPTIONS
First, you need to know what assumptions you are making. This is easier said than done! Stop and step back from the problem and try to take the place of an external observer. Have you made assumptions about the situation or system behavior without realizing it? Think divergently about all the implied assumptions that may have been made. Question each of these assumptions. Is there a test that can be performed to confirm or deny the assumption? Have you assumed that your test or tool is accurate and working properly? Can you validate your test or tool to be sure it is providing valid information?
A FRESH PERSPECTIVE
It's easy to get so dug into a problem that it is impossible to see the forest for the trees. Ask for help. Get another set of eyes to lend a fresh insight. When asking for help, report the symptoms and observations, not your theories. Be receptive to the input of others.
VALIDATE YOUR SOLUTION
If you didn't fix it, it's still broken. Don't assume that your action fixed the problem. Prove it! If you have a sequence of steps that reliably reproduces the failure, repeat those steps and validate that the problem does not occur. If you are unsure if your fix really did address the issue, remove the fix and make the problem occur again. Then, place your fix back into place and verify that the problem does not occur. If you can make the problem occur and not at will, you've clearly found the issue and a fix.
Be sure you fixed the cause of the problem and are not just masking the result. Remember, problems never just go away by themselves. You need to be sure you really did fix it.
THE BOTTOM LINE
We frequently rely on our experience and intuition when troubleshooting, but applying this structured approach can yield better quality solutions in less time and with fewer unwanted side effects.