PSMA Cloud and Addresses API outage
Incident Report for Geoscape
Postmortem

What happened?

  • All PSMA Cloud functions and the Addresses API address search calls experienced an outage from 11:44am to 6:43pm AEDT.
  • The Online Data Delivery Service experienced an outage from 11:44am to 12:48pm AEDT.
  • The outages were caused by the failure of a switch in our hosting provider's network. There was a significant delay to the return of PSMA Cloud services as a critical database server could not be restored by the hosting provider using the normal processes.

11:44am

  • Start of outage

11:48am

  • Geoscape's monitoring notifications indicate that a number of servers and their associated services are experiencing an outage.
  • Investigation reveals that the affected servers are all related to a third-party hosting provider.
  • The hosting provider is contacted and the outage is confirmed to be within their environment and they raise a P1 incident.

12:05pm

  • A Statuspage incident is raised to advise customers of the outage.

12:15pm

  • The hosting provider identifies that the affected servers have disconnected from the storage channel paths, preventing them from seeing any datastores. Remediation action commences.

12:50pm

  • Some servers are returned to operation. The Online Data Delivery Service becomes available again.
  • The hosting provider notes that a database server is failing to start up. A backup restore is commenced for this server.

2:20pm

  • Additional servers are brought online.
  • The hosting provider identifies the root cause to be a failure of a storage switch module.

3:00pm

  • Restoring the remaining database server continues to be unsuccessful due to corruption issues. Another attempt is made to recover the server.

3:45pm

  • The corruption errors continue for the database server so a decision is made to deploy a Geoscape database backup onto a previously de-commissioned server. Work commences to re-commission the server and restore the database.

5:40pm

  • Testing of this server and database is successful. The hosting provider is requested to update the server configuration, firewall and network rules so that it replaces the existing failed database server.

6:43pm

  • As soon as the configurations are complete, services are restored (End of outage)

What did we learn?

  • The hosting provider had not tested server backups.
  • The extra activity that Geoscape conducts to backup databases 'just in case' was validated.
  • Moving our infrastructure to a new cloud-based host is a good thing.

What are we going to do?

  • The hosting provider has been requested to confirm that all server backups are tested on a monthly basis.
  • Continue efforts to move all PSMA Cloud infrastructure to the new cloud-based host.
Posted Jan 20, 2021 - 23:13 AEDT

Resolved
This incident has been resolved.
Posted Jan 12, 2021 - 07:55 AEDT
Monitoring
The services have been restored and we will continue to monitor the results
Posted Jan 11, 2021 - 18:48 AEDT
Identified
The issue is related to an internal error within our hosting provider's environment. We are currently working with them to resolve this as soon as possible.
Posted Jan 11, 2021 - 13:38 AEDT
Investigating
We are currently experiencing an outage to a number of services.

The team are investigating and will provide an update as soon as possible.

Best Regards,

Geoscape Support Team
Geoscape Australia
T: +61 (0)2 6260 9000
E: support@geoscape.com.au
https://support.geoscape.com.au
Posted Jan 11, 2021 - 12:05 AEDT
This incident affected: APIs (Addresses API) and PSMA Cloud.