Over the Christmas period, the University website at wawick.ac.uk became unavailable overnight. We’re sorry to anyone who was trying to look up information at this time.
Part of the database backup process failed. When this happens, database changes stack up in a queue on the standby database until the backup process can restart and collect all the changes. Due to the Christmas shutdown period, we don’t have the usual level of human monitoring of alerts and so this problem was missed until the standby database ran out of disk space, which in turn causes the live database to stop (as it streams a replica of the data to the standby and will refuse to let the replica fall out of sync, since it becomes a lot harder to get them back in sync if it has dropped updates).
The issue was resolved by increasing the disk space available to allow the backup process to complete and replication to continue.
Although we have monitoring in place to send alerts, and alerts were being sent a couple of days before disk space ran out, that’s only useful if someone is around to see the alerts. We’ll review whether some of the critical alerts can be sent through other channels, such as SMS, so that they can be reviewed.
There may also be a review into the backup system to see whether we can decouple it from the live database so that a failed backup isn’t able to disrupt service - or at least has enough of a disk buffer that it can survive for at least the length of the Christmas shutdown period.