All PSMA Cloud functions and the Addresses API address search calls experienced an outage from 11:44am to 6:43pm AEDT.
The Online Data Delivery Service experienced an outage from 11:44am to 12:48pm AEDT.
The outages were caused by the failure of a switch in our hosting provider's network. There was a significant delay to the return of PSMA Cloud services as a critical database server could not be restored by the hosting provider using the normal processes.
11:44am
Start of outage
11:48am
Geoscape's monitoring notifications indicate that a number of servers and their associated services are experiencing an outage.
Investigation reveals that the affected servers are all related to a third-party hosting provider.
The hosting provider is contacted and the outage is confirmed to be within their environment and they raise a P1 incident.
12:05pm
A Statuspage incident is raised to advise customers of the outage.
12:15pm
The hosting provider identifies that the affected servers have disconnected from the storage channel paths, preventing them from seeing any datastores. Remediation action commences.
12:50pm
Some servers are returned to operation. The Online Data Delivery Service becomes available again.
The hosting provider notes that a database server is failing to start up. A backup restore is commenced for this server.
2:20pm
Additional servers are brought online.
The hosting provider identifies the root cause to be a failure of a storage switch module.
3:00pm
Restoring the remaining database server continues to be unsuccessful due to corruption issues. Another attempt is made to recover the server.
3:45pm
The corruption errors continue for the database server so a decision is made to deploy a Geoscape database backup onto a previously de-commissioned server. Work commences to re-commission the server and restore the database.
5:40pm
Testing of this server and database is successful. The hosting provider is requested to update the server configuration, firewall and network rules so that it replaces the existing failed database server.
6:43pm
As soon as the configurations are complete, services are restored (End of outage)
What did we learn?
The hosting provider had not tested server backups.
The extra activity that Geoscape conducts to backup databases 'just in case' was validated.
Moving our infrastructure to a new cloud-based host is a good thing.
What are we going to do?
The hosting provider has been requested to confirm that all server backups are tested on a monthly basis.
Continue efforts to move all PSMA Cloud infrastructure to the new cloud-based host.
Posted Jan 20, 2021 - 23:13 AEDT
Resolved
This incident has been resolved.
Posted Jan 12, 2021 - 07:55 AEDT
Monitoring
The services have been restored and we will continue to monitor the results
Posted Jan 11, 2021 - 18:48 AEDT
Identified
The issue is related to an internal error within our hosting provider's environment. We are currently working with them to resolve this as soon as possible.
Posted Jan 11, 2021 - 13:38 AEDT
Investigating
We are currently experiencing an outage to a number of services.
The team are investigating and will provide an update as soon as possible.