University of Warwick system status

Applications not available
Incident Report for University of Warwick
Postmortem

What happened?

On 10th May 2022 some of our applications started displaying a server problem message instead of the application itself. Within a couple of hours we had identified a workaround to restore access, and we now have a rough understanding of the root cause. We apologise for inconvenience caused by application unavailability on this day. The below hopefully explains what happened, how it’s fixed and how we plan to avoid it happening again.

The problem

Each of our applications runs as a pool of backend servers behind an F5 load balancer appliance, which spreads requests out to each of the backend servers. The appliance will periodically monitor each backend server to make sure it’s still usable, and remove it from the pool if it decides it isn’t. On the 10th the appliance had suddenly decided that it couldn’t reach some of the backend servers, and for some applications this included all of the pool members, rendering that application unavailable. This happened to a relatively small number of our applications but unfortunately one of them was Single Sign-on, which most of the applications do rely on, so when visiting a working application you would have been redirected to a non-working Single Sign-on.

After some investigation to rule out firewall issues, we discovered that removing the monitor check brought the application back - the appliance was able to reach the application, it was just its monitor that couldn’t. To add to the confusion, re-adding the same monitor would often work, which suggested something at fault with the appliance itself.

We have a load balancer appliance in each data centre to allow one to fail-over to the other if it needs to, so an emergency change was scheduled for Sunday 15th May to attempt that failover at an off-peak time. This failover was successful and testing revealed that this appliance did not have the issue, confirming a fault with the other appliance.

What’s next?

At the moment we are in a somewhat risky state, as the active appliance needs to be able to rely on the other appliance to fail over to if it failed for any reason, unlikely though that is. We do still have the original workaround if that did happen. We have started conversations with the F5 support team about the problem, and we were already in the process of acquiring brand new F5 load balancers anyway which will replace these two appliances over the coming months, and should be more reliable.

Posted May 17, 2022 - 14:57 BST

Resolved
There has been no recurrence of the issue. Root cause investigations continue and applications continue to be monitored as usual.
Posted May 11, 2022 - 14:46 BST
Monitoring
All services that were unavailable have now been restored. We continue to monitor the situation and investigate the root cause.
Posted May 10, 2022 - 16:41 BST
Update
Service has been restored on some applications and we are working to restore the remaining. We are still investigating the root cause.
Posted May 10, 2022 - 15:29 BST
Investigating
We are aware of an issue affecting various web applications and are currently investigating.
Posted May 10, 2022 - 13:42 BST
This incident affected: Sitebuilder (Web pages and files, Editing, Page statistics, Form submissions, Online Payments, Sitebuilder forums, Search, PeopleSearch, Experts Directory, Files.Warwick), Academic data management (Course & Module Catalogue, Mass Mailing, Module approval, MRM, Scholarships), Tabula (Coursework submission, Timetables, meeting records and profiles, Exam timetables, Assessment management, Small group teaching, Exam grids, Monitoring points, Mitigating circumstances, Tabula API), My Warwick (My Warwick web & mobile apps, Alert publishing, News publishing, Activities API, Push notifications, Welcome Week), Authentication (Single sign-on, IT Services Account Management, SMS messages, External service access, Account registration, University card photos, WebGroups, External user management, Student records web authentication), and Moodle, Wellbeing Portal, Online Exams, Training Essentials.