What happened?

The Production Predictive API service had a 54-minute outage caused by human error. The disruption to services was immediately identified and recovery actions were directly initiated.

9:21 am (start of outage)

An engineer deleted a component of the Predictive API in the production environment.
The error was immediately identified by both the engineer and automated monitoring.
Recovery actions to restore the service were initiated at once.

10:15 am (end of outage)

Manual testing and automated monitoring confirmed the return of all services.

What did we do?

Automated monitoring triggered an outage notification to all customers through status.psma.com.au. We quickly enacted a recovery plan to restore services. Once restored, we monitored manually for a period before going back to automated monitoring. This then allowed us to start our postmortem analysis to identify why this happened and how we can do better.

What did we learn?

Manual infrastructure changes are rare given our use of ‘infrastructure as code’. Still, when they are required, clear labelling of components becomes very important. What works for code may not be enough for humans. We were unhappy with the speed of automated deployment in the recovery process.

What are we going to do?

Improve the labelling of cloud infrastructure and components to be more straightforward and explicit (not just good for automated deployment) to prevent confusion.
Improve recovery processes to reduce the time for service restoration.
Improve the accessibility and usefulness of system logs to facilitate more effective investigations.

Posted Feb 03, 2020 - 14:56 AEDT

Resolved

This incident has been resolved.

Posted Jan 28, 2020 - 08:11 AEDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 24, 2020 - 10:38 AEDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 24, 2020 - 10:04 AEDT

This incident affected: APIs (Predictive API).