The Production Predictive API service had a 54-minute outage caused by human error. The disruption to services was immediately identified and recovery actions were directly initiated.
9:21 am (start of outage)
The error was immediately identified by both the engineer and automated monitoring.
Recovery actions to restore the service were initiated at once.
10:15 am (end of outage)
Automated monitoring triggered an outage notification to all customers through status.psma.com.au. We quickly enacted a recovery plan to restore services. Once restored, we monitored manually for a period before going back to automated monitoring. This then allowed us to start our postmortem analysis to identify why this happened and how we can do better.
Manual infrastructure changes are rare given our use of ‘infrastructure as code’. Still, when they are required, clear labelling of components becomes very important. What works for code may not be enough for humans. We were unhappy with the speed of automated deployment in the recovery process.
Improve the labelling of cloud infrastructure and components to be more straightforward and explicit (not just good for automated deployment) to prevent confusion.
Improve recovery processes to reduce the time for service restoration.
Improve the accessibility and usefulness of system logs to facilitate more effective investigations.