Meta
- There are no hard problems. There is just lack of information about how the system works
- Remember that the bug is happening for a logical reason
- Be unreasonably confident in your ability to fix the bug
- Every error is an opportunity to learn
- Be aware of the imposter syndrome
- Get enough sleep and take breaks
- Try to tackle hard problems in the morning with a fresh mind and without disruption (before you check mails, chat, ticket system, monitoring, …)
Finding the root cause of the problem
- What’s the error message? Are there any log files?
- Read the error description. Every word of it. Twice.
- Is there a typo somewhere (command line/configuration/code)?
- Try to get the issue reproducible
- Can you reproduce it from the command line?
- It’s easier for other people to reproduce the issue
- It’s easier to test the fix
- Can you reproduce it from the command line?
- Isolate the problem
- Remove some parts of the system and try to reproduce the bug
- Vary one thing at a time while keeping all other things constant
Issue still not fixed? Checklist
- Does the problem occur only on a single server? The same thing runs flawless somewhere else?
- What’s the difference? Compare!
- Can you increase the debug log?
- What parts of the system do you not understand? Take your time and learn about it!
- Do you have multiple issues? Try to solve the underlying issue first
- Get a stable debugging environment
- When did the problem occur first? What has changed?
- Is it really a problem or intended behavior (security feature?)
- Do some sanity checks
- Are you on the right virtual machine?
- Can you ping the target host?
- Is DNS still working?
- Check network traffic with ngrep/tcpdump. Do you see what you expect?
- Is one of the disks full?
- Are you editing the right file?
- Write some garbage and try to compile
- Check the monitoring system
- Do other VMs of the customer have problems?
- Do other VMs running on the same hypervisor have problems?
- Is the whole data center down?
- Is the customer logged in on the system? What is he doing (check bash_history and
ps -u
)?
After some time of debugging
- Express the problem to a random teddy bear in an easy and comprehensive way
- Be patient and accept that things just take longer than expected
- Try to understand what happens. Not: endless trial and error guessing
- Is there documentation that can help you understanding the system?
- Talk to other people knowing the system better than you
- Problem is not business critical? Set it aside
- Take a break (go for a walk, do some exercises, …)
- Step back: What’s the actual goal you are trying to achieve? What’s the problem?
- You’re out of time and stuck on details?
- Use a different approach to solve your actual problem
If you copy/paste from Stackoverflow (we all do, at least sometimes)
- Don’t copy/paste from Stackoverflow without understanding the actual problem
- Don’t copy/paste from Stackoverflow without understanding the proposed solution
- If you don’t have time for it right now => make a note about it (even after solving it)
- If you don’t know what the command/tool is doing => read the man page (https://explainshell.com)
- Don’t copy/paste commands/code. Type it on your own!
After solving the issue
- Well done! I’m glad you didn’t give up! It’s time to celebrate your success!
- What have you learned during the journey?
- What were the wrong assumptions?
- Can prevent the problem from happening again in the future (write tests/docs, monitoring)?
- How can you solve a similar problem in future even faster?