AEP unavailable

Incident Report for University of Warwick

Postmortem

At 17:20 on June 3rd, we began preparing a new release of the Alternative Exams Portal. Release v57 would include a range of changes designed to improve the performance of the system, particularly for morning fixed-start assessments where many students attempt to access the portal at the same time.

One of these performance improvements involved changing our scheduling library (Quartz) to use a separate JDBC connection to the rest of the app in an attempt to put some more separation in place so that connections blocked on waiting for specific PostgreSQL locks to be released in the Quartz tables don’t affect other database connections which are used for retrieving domain data about the actual examinations.

We use Puppet internally to centrally manage configurations for our application node VMs and as part of the database connection change, we had modified the application.conf file to include the new connection string and deployed this configuration change to all nodes in the cluster.

When we started the deployment of version 57 at 17:47 (we try to avoid busy times), we quickly noticed the deployment wasn’t working as we expected. The deploy had only taken effect on half of the application nodes. The deployment was repeated and failed once again. The app was now running into problems acquiring database connections - we suspected some sort of connection limit had been hit due to the addition of the new JDBC connection - and struggling to serve incoming HTTP requests.

The decision was taken to back out the new release and re-deploy the known-good previous version: v56.3. This version of the app used one connection for both Quartz and the rest of the data access logic in the AEP application. The team had hoped that the configuration change made in preparation for v57 would be ignored, in the same way that other unknown configuration keys in a Play! framework app would usually be ignored, and we could just simply restore service by running the older version of the code.

This rollback unfortunately failed - v56.3 of the application simply refused to start (Failed to create Slick database config for key quartz) following the application.conf configuration change to add a new JDBC connection. By 18:05, the AEP was completely inaccessible and had been for a few minutes. Incoming requests were failing at the F5 load balancer. Most nodes had been taken out of the load balancer pool once our monitoring systems noticed they were broken. The end result was that there were no working application nodes available to serve visitors. These failures manifested as TLS connection problems in most browsers.

Whilst a configuration change rollback was initiated via Puppet to fix application.conf, we knew it would take some time for every application and scheduling node to be updated. The decision was made to speed up the recovery by removing the problematic configuration data. We did this by hand, SSHing into nodes and commenting out the lines in question with vim before forcibly restarting the systemd service and manually re-adding the node to the load balancer pool.

By 18:13 - 8 minutes later - we’d managed to get the service back up in a state where it could start responding to HTTP requests - albeit with a very small number of application nodes and limited redundancy in case of further failure. We could then start to work on trying to speed along the propagation of the Puppet change to the rest of the nodes ready for the Play server to be started back up again.

The vast majority of nodes were updated within 15 minutes, but a handful took slightly longer. We think this was due to overutilisation on our Puppet compile masters which meant that fetching the Puppet catalog on each node was slow - sometimes taking upwards of 10 minutes. It wasn’t until 18:50 that things were completely back to normal, with all nodes in a healthy state.

Moving forward, in the short-to-medium term we will be:

Working to try to improve the speed at which we can propagate Puppet configuration changes out to managed nodes, starting by adding more resource to the compile masters, ensuring that they are configured properly to use the available resource and/or potentially investigating further horizontal scaling. Later, we’ll revisit our continuous deployment implementation and move away from a polling-based approach to one that’s triggered immediately by a git push.
Increasing pgbouncer’s max_client_conn to cope with the increased number of connections that come from maintaining a separate, dedicated Quartz connection on each node

We have already:

Contacted affected students/departments who were involved with 24h exams ending during the downtime. We believe only a handful of (3) students submitted late due to this downtime. We’ve reached out to all of these candidates and explained the next steps to ensure that their late submission can be considered in the correct context by the academic department.

Departments and students rely on the AEP to be reliable and available 24 hours of the day and on this occasion we have fallen short of the standards we try our best to maintain within the team - we would like to apologise for the inconvenience caused by this incident.

Posted Jun 03, 2021 - 20:33 BST

Resolved

The issue is resolved and AEP is fully available.

Posted Jun 03, 2021 - 19:15 BST

Update

More healthy nodes are now returning to the load balancer pool ready to pick up incoming HTTP requests. We will continue monitoring the site as it recovers full operational capacity but we believe that AEP use should have been possible after 18:13 this evening (with some brief additional downtime at around 19:00).

Posted Jun 03, 2021 - 19:04 BST

Monitoring

We have implemented a workaround to restore access to AEP whilst we resolve the root cause. This workaround has allowed us to serve incoming HTTP requests.

Posted Jun 03, 2021 - 18:13 BST

Identified

We're currently looking into issues with the AEP site.

Posted Jun 03, 2021 - 18:06 BST