At 17:20 on June 3rd, we began preparing a new release of the Alternative Exams Portal. Release v57 would include a range of changes designed to improve the performance of the system, particularly for morning fixed-start assessments where many students attempt to access the portal at the same time.
One of these performance improvements involved changing our scheduling library (Quartz) to use a separate JDBC connection to the rest of the app in an attempt to put some more separation in place so that connections blocked on waiting for specific PostgreSQL locks to be released in the Quartz tables don’t affect other database connections which are used for retrieving domain data about the actual examinations.
We use Puppet internally to centrally manage configurations for our application node VMs and as part of the database connection change, we had modified the application.conf
file to include the new connection string and deployed this configuration change to all nodes in the cluster.
When we started the deployment of version 57 at 17:47 (we try to avoid busy times), we quickly noticed the deployment wasn’t working as we expected. The deploy had only taken effect on half of the application nodes. The deployment was repeated and failed once again. The app was now running into problems acquiring database connections - we suspected some sort of connection limit had been hit due to the addition of the new JDBC connection - and struggling to serve incoming HTTP requests.
The decision was taken to back out the new release and re-deploy the known-good previous version: v56.3. This version of the app used one connection for both Quartz and the rest of the data access logic in the AEP application. The team had hoped that the configuration change made in preparation for v57 would be ignored, in the same way that other unknown configuration keys in a Play! framework app would usually be ignored, and we could just simply restore service by running the older version of the code.
This rollback unfortunately failed - v56.3 of the application simply refused to start (Failed to create Slick database config for key quartz
) following the application.conf
configuration change to add a new JDBC connection. By 18:05, the AEP was completely inaccessible and had been for a few minutes. Incoming requests were failing at the F5 load balancer. Most nodes had been taken out of the load balancer pool once our monitoring systems noticed they were broken. The end result was that there were no working application nodes available to serve visitors. These failures manifested as TLS connection problems in most browsers.
Whilst a configuration change rollback was initiated via Puppet to fix application.conf
, we knew it would take some time for every application and scheduling node to be updated. The decision was made to speed up the recovery by removing the problematic configuration data. We did this by hand, SSHing into nodes and commenting out the lines in question with vim before forcibly restarting the systemd service and manually re-adding the node to the load balancer pool.
By 18:13 - 8 minutes later - we’d managed to get the service back up in a state where it could start responding to HTTP requests - albeit with a very small number of application nodes and limited redundancy in case of further failure. We could then start to work on trying to speed along the propagation of the Puppet change to the rest of the nodes ready for the Play server to be started back up again.
The vast majority of nodes were updated within 15 minutes, but a handful took slightly longer. We think this was due to overutilisation on our Puppet compile masters which meant that fetching the Puppet catalog on each node was slow - sometimes taking upwards of 10 minutes. It wasn’t until 18:50 that things were completely back to normal, with all nodes in a healthy state.
Moving forward, in the short-to-medium term we will be:
max_client_conn
to cope with the increased number of connections that come from maintaining a separate, dedicated Quartz connection on each nodeWe have already:
Departments and students rely on the AEP to be reliable and available 24 hours of the day and on this occasion we have fallen short of the standards we try our best to maintain within the team - we would like to apologise for the inconvenience caused by this incident.