How to solve hard (technical) problems

alt text

Meta

  • There are no hard problems. There is just lack of information about how the system works
  • Remember that the bug is happening for a logical reason
  • Be unreasonably confident in your ability to fix the bug
  • Every error is an opportunity to learn
  • Be aware of the imposter syndrome
  • Get enough sleep and take breaks
  • Try to tackle hard problems in the morning with a fresh mind and without disruption (before you check mails, chat, ticket system, monitoring, …)

Finding the root cause of the problem

  • What’s the error message? Are there any log files?
    • Read the error description. Every word of it. Twice.
    • Is there a typo somewhere (command line/configuration/code)?
  • Try to get the issue reproducible
    • Can you reproduce it from the command line?
      • It’s easier for other people to reproduce the issue
      • It’s easier to test the fix
  • Isolate the problem
    • Remove some parts of the system and try to reproduce the bug
    • Vary one thing at a time while keeping all other things constant

Issue still not fixed? Checklist

  • Does the problem occur only on a single server? The same thing runs flawless somewhere else?
    • What’s the difference? Compare!
  • Can you increase the debug log?
  • What parts of the system do you not understand? Take your time and learn about it!
  • Do you have multiple issues? Try to solve the underlying issue first
  • Get a stable debugging environment
  • When did the problem occur first? What has changed?
  • Is it really a problem or intended behavior (security feature?)
  • Do some sanity checks
    • Are you on the right virtual machine?
    • Can you ping the target host?
    • Is DNS still working?
    • Check network traffic with ngrep/tcpdump. Do you see what you expect?
    • Is one of the disks full?
    • Are you editing the right file?
      • Write some garbage and try to compile
    • Check the monitoring system
      • Do other VMs of the customer have problems?
      • Do other VMs running on the same hypervisor have problems?
      • Is the whole data center down?
    • Is the customer logged in on the system? What is he doing (check bash_history and ps -u)?

After some time of debugging

  • Express the problem to a random teddy bear in an easy and comprehensive way
  • Be patient and accept that things just take longer than expected
  • Try to understand what happens. Not: endless trial and error guessing
    • Is there documentation that can help you understanding the system?
    • Talk to other people knowing the system better than you
  • Problem is not business critical? Set it aside
  • Take a break (go for a walk, do some exercises, …)
  • Step back: What’s the actual goal you are trying to achieve? What’s the problem?
  • You’re out of time and stuck on details?
    • Use a different approach to solve your actual problem

If you copy/paste from Stackoverflow (we all do, at least sometimes)

  • Don’t copy/paste from Stackoverflow without understanding the actual problem
  • Don’t copy/paste from Stackoverflow without understanding the proposed solution
    • If you don’t have time for it right now => make a note about it (even after solving it)
    • If you don’t know what the command/tool is doing => read the man page (https://explainshell.com)
    • Don’t copy/paste commands/code. Type it on your own!

After solving the issue

  1. Well done! I’m glad you didn’t give up! It’s time to celebrate your success!
  2. What have you learned during the journey?
  3. What were the wrong assumptions?
  4. Can prevent the problem from happening again in the future (write tests/docs, monitoring)?
  5. How can you solve a similar problem in future even faster?