AWS Sydney - Degraded performance leading to intermittent outages
Incident Report for Geoscape
Postmortem

What happened?

Our service provider, AWS, experienced a failure of a data store that affected all customers of that service. Unfortunately, PSMA was an affected customer.

More information on AWS outages can be obtained from the AWS Service Health Dashboard.

10:07am - PSMA received advice from AWS of a failure in a subsystem affecting services that use VPC (Virtual Private Cloud) in all availability zones in the ap-southeast-2 (Sydney) region. PSMA services were unaffected at this time.

12:34pm - PSMA monitoring detected intermittent failures and some increased latency on the Addresses API, Predictive API and Beta APIs. Occasional API call time outs or failures were being experienced.

3:49pm - AWS had disabled writes to the datastore to allow restoration to a previously good state. This had the impact of increasing the rate of failed API calls to PSMA services.

4:20pm - Almost all API calls were now failing on the Addresses API, Predictive API, Buildings API and Beta APIs.

4:40pm - AWS had successfully restored the datastore and re-enabled writes, leading to an increase in successful API calls on PSMA services.

4:55pm - All PSMA services fully recovered

5:55pm - AWS advised that the restore was successful and all services were operating across the region again.

What did we do?

PSMA were unable to take any action as the issue was entirely within the domain of AWS. PSMA continued to monitor services and the activities of AWS throughout the period.

What did we learn?

While the chance of AWS failure across an entire region is rare, it is a genuine risk.

What are we going to do?

PSMA will investigate options to allow the continuation of services in the event of AWS-wide incidents in the future.

We would appreciate feedback on the expectations and any concerns you may have in regards to this event.

Posted Jan 24, 2020 - 15:58 AEDT

Resolved
This incident has been resolved.
Posted Jan 24, 2020 - 08:06 AEDT
Monitoring
As of 4:55pm services are no longer degraded. We will continue to monitor and supply an update when we have more feedback from AWS.
Posted Jan 23, 2020 - 17:29 AEDT
Update
AWS has advised an ETA for the return of fully operational services in ~2 Hours.
We will continue to update as we have more details.
Posted Jan 23, 2020 - 16:10 AEDT
Identified
We're aware of an intermitted issue with AWS at the moment that is causing some failed API calls on PSMA services. We're actively following up on the situation and will update accordingly.

AWS has identified the issue and is working multiple paths to resolve the issue.

Cheers
Keenan
Posted Jan 23, 2020 - 15:00 AEDT
This incident affected: APIs (Predictive API, Addresses API, Buildings API) and Beta APIs.