At around 12:45pm on Tuesday 12th January, we started to see a spike in response times for requests to the Tabula API, and an increase in error rates for My Warwick (which is a heavy consumer of this data). This continued to escalate over the following hour, to the point where it appeared that the Tabula API was unable to cope with the volume of requests (requests were being backed up, one after the other, and the queue of requests to be processed was growing).
The impact of this was:
An incident response team comprising of the software engineering and infrastructure teams worked to try and diagnose the underlying cause of issues. We were unable to detect a change in usage patterns of the Tabula web application, the Tabula API or My Warwick that would cause the 10x increase in load onto the Tabula database, but we were able to see a large number of requests to individual endpoints that we tried to resolve.
As part of this work, we made changes to the web load balancer that underpins all the internal web applications at Warwick and their databases. Due to a bug in the load balancer software, this led to a partial outage of some services between approximately 5:30pm and 5:45pm, and ongoing sporadic issues with access to web systems until 9:15am on Wednesday 13th January, at which point the problems were fixed. As always, our priority when responding to incidents is to restore service as soon as possible, and we made the decision - after adding additional resources to the Tabula database to cope with the increased workload - to leave the application in a degraded state and restore My Warwick access at 7pm.
Additional fixes to the Tabula source code to reduce the amount of work done in the database were released up to 10pm. At this point, we had a working theory that a large number of “user lookups” were failing to be returned by the single sign-on system, which then caused Tabula to fall back on its own database for the information, and this was done in an extremely inefficient manner (issuing one request to the database for each missing user). A code change was made at 9:15am on Wednesday 13th January to fetch these in bulk from the Tabula database rather than one at a time, and at that point we saw the workload on the Tabula database return to what it was before the incident started.
On further investigation of the underlying causes, we identified a change to the single sign-on system that had been made just before 12pm on Tuesday 12th January which changed the behaviour of lookups of multiple users by their 7-digit Warwick University ID at the same time. We rolled back this change when it was identified and have put additional checks in place to ensure that it cannot re-occur.