Planning, chaos, runbook, and automation
There is no way to stop software and systems from going terribly wrong. You can do some things to mitigate.
Planning for failure
The most basic, and best thing you can do is to engineer your systems knowing that they will sometimes fail.
This will force you to think about how they will fail. What are the limitations of the components? What should you monitor to know that failure is imminent? Can you monitor things that will help you distinguish between different kinds of failure?
Distilling chaos into runbooks
When things do eventually fail, there is an art to managing the chaos. For example: everyone responding to a crisis must be using the same communication channel. Anyone who changes something should be saying so in the channel. It might be necessary to assign someone to be the "incident commander" (although I hate this phrase, it gets the point across) to ensure that people are working together and not in different directions.
At the end of the incident, hopefully you have (because everyone used the same channel for communication) a complete log of what happened, what symptoms were reported, what metrics were saying at that time, what actions were taken and why. And now the most important part of the incident: turn that log into knowledge. Make a runbook for this kind of incident.
A good runbook starts with a brief description of the problem. Then it describes what the underlying cause probably is (the hypothesis), and a grade of "how terrible is this problem". Then it describes how to verify that the hypothesis is true on the running system. Given that the hypothesis is true, it then describes what other things will be broken, and how to fix the problem.
Every time that an incident occurs, and even if the runbook is successfully applied, it should be everyone's job to update the runbook with any new observations or ameliorations.
When runbooks get too tedious, automate
At some point down the road, maybe your runbooks haven't changed in a while and they just work.
Now you can take time off fire-fighting duties and turn that runbook into automation.