Often I’m asked “How did you figure that out?” I hear it more often if the problem has plagued my team for a while, or if it is a problem that everyone has taken a swing at and struck out. It is the first question I ask when someone else figures out the solution before I do. I suppose that I’m taking for granted that troubleshooting is something that everyone just knows how to do. For some people I guess its second nature, while with others it can be daunting.
In my experience, here are a few troubleshooting tips to help you figure out a problem:
What is the error?
The first thing I do when I’m sorting out an issue is to simply define the issue. What is the error code? What is the symptom? Did you see a SCSI phase error or was it a “Warning:” What did the screen say when it blue screened? Understanding and defining the issue gives you a starting place to troubleshoot. Something didn’t work and now you can name it.
What is supposed to happen?
This is another round of definition that goes early into the troubleshooting cycle. If you don’t know what was supposed to happen, then how can you figure out where it came off the rail? There is some expectation of behavior. Thinking through that expectation allows you to then think through why the expectation wasn’t met. What part of that expectation failed first? Now there is a pointer toward something to fix.
Simplify the system and rebuild
Sometimes the best approach is to remove all of the other “it could be this” components and start from the simplest functional configuration. I spend most of my time troubleshooting computer hardware – all the flashy light stuff found in a data center. I’ll often start with just power supplies, system boards, a processor and the lowest functional memory configuration. This helps identify where the problem is. Start simple and build to complexity as at some point the system will break again. When it happens again, your cause will very likely be that last change. You’ve controlled it and now you can fix it.
Trial and Error
Complex systems sometimes take what I’ll call the “poke it with a stick” approach. Take a systematic look at the problem and the environment. Then change something and restart. Did it fix the problem? If so, go buy your lotto tickets; if it didn’t, change it back and change something else. Although it is very effective, this is a slow process so be patient.
Start from scratch
I’ll be honest – I hate this tip. I hate when I’m troubleshooting a software stack or a system and I’ve simplified it the best way I know how, I’ve done a lot of trial and error changes and nothing is working. If that happens there’s always the option to start over. Ugh.
Lather, rinse, repeat
That’s probably the best tip. Just keep at it. Don’t let the system, the machine, or the problem beat you. Troubleshooting is one part skill, one part luck, and a whole lot of patience. Keep thinking through the what, the why, and the how. Remember that as you go through the process what happened, what should have happened, where did it go awry, and what can change, you’re learning volumes about your system. You’re basically stepping through the system and each step along the way you’re learning something you didn’t know before. Don’t give up. If you stay at it you’ll win. We all like to win, right?