Tabula database performance issues

Incident Report for University of Warwick

Postmortem

At around 12:45pm on Tuesday 12th January, we started to see a spike in response times for requests to the Tabula API, and an increase in error rates for My Warwick (which is a heavy consumer of this data). This continued to escalate over the following hour, to the point where it appeared that the Tabula API was unable to cope with the volume of requests (requests were being backed up, one after the other, and the queue of requests to be processed was growing).

The impact of this was:

Requests to Tabula were slow, both to the API and web application - this is because the number of queries being processed by the Tabula database had increased 10-fold; that is, we were handling approximately 10x the amount of work that we would normally be handling
Due to the slow requests, some services were hitting timeouts - notably, timetables and other information in the My Warwick app, which depends on the Tabula API, weren’t being returned. At around 2pm, we made the decision to shut down the component of My Warwick that fetches these requests in an attempt to restore other critical services on Tabula; this was successful, but meant that timetables and other information in My Warwick was completely unavailable from 2pm until around 7pm

An incident response team comprising of the software engineering and infrastructure teams worked to try and diagnose the underlying cause of issues. We were unable to detect a change in usage patterns of the Tabula web application, the Tabula API or My Warwick that would cause the 10x increase in load onto the Tabula database, but we were able to see a large number of requests to individual endpoints that we tried to resolve.

As part of this work, we made changes to the web load balancer that underpins all the internal web applications at Warwick and their databases. Due to a bug in the load balancer software, this led to a partial outage of some services between approximately 5:30pm and 5:45pm, and ongoing sporadic issues with access to web systems until 9:15am on Wednesday 13th January, at which point the problems were fixed. As always, our priority when responding to incidents is to restore service as soon as possible, and we made the decision - after adding additional resources to the Tabula database to cope with the increased workload - to leave the application in a degraded state and restore My Warwick access at 7pm.

Additional fixes to the Tabula source code to reduce the amount of work done in the database were released up to 10pm. At this point, we had a working theory that a large number of “user lookups” were failing to be returned by the single sign-on system, which then caused Tabula to fall back on its own database for the information, and this was done in an extremely inefficient manner (issuing one request to the database for each missing user). A code change was made at 9:15am on Wednesday 13th January to fetch these in bulk from the Tabula database rather than one at a time, and at that point we saw the workload on the Tabula database return to what it was before the incident started.

On further investigation of the underlying causes, we identified a change to the single sign-on system that had been made just before 12pm on Tuesday 12th January which changed the behaviour of lookups of multiple users by their 7-digit Warwick University ID at the same time. We rolled back this change when it was identified and have put additional checks in place to ensure that it cannot re-occur.

Posted Jan 13, 2021 - 12:38 GMT

Resolved

The issues with Tabula performance have now been fixed.

Posted Jan 13, 2021 - 11:10 GMT

Monitoring

We've made a change to reduce the load on Tabula's database, which appears to have been effective. We are currently monitoring the results.

Posted Jan 13, 2021 - 09:58 GMT

Update

We have increased available computing power, and are monitoring while we continue to investigate Tabula database underperformance.

Posted Jan 12, 2021 - 19:00 GMT

Investigating

We're currently investigating performance problems with the Tabula database, which will also affect My Warwick updates.

Posted Jan 12, 2021 - 16:04 GMT