On 10th May 2022 some of our applications started displaying a server problem message instead of the application itself. Within a couple of hours we had identified a workaround to restore access, and we now have a rough understanding of the root cause. We apologise for inconvenience caused by application unavailability on this day. The below hopefully explains what happened, how it’s fixed and how we plan to avoid it happening again.
Each of our applications runs as a pool of backend servers behind an F5 load balancer appliance, which spreads requests out to each of the backend servers. The appliance will periodically monitor each backend server to make sure it’s still usable, and remove it from the pool if it decides it isn’t. On the 10th the appliance had suddenly decided that it couldn’t reach some of the backend servers, and for some applications this included all of the pool members, rendering that application unavailable. This happened to a relatively small number of our applications but unfortunately one of them was Single Sign-on, which most of the applications do rely on, so when visiting a working application you would have been redirected to a non-working Single Sign-on.
After some investigation to rule out firewall issues, we discovered that removing the monitor check brought the application back - the appliance was able to reach the application, it was just its monitor that couldn’t. To add to the confusion, re-adding the same monitor would often work, which suggested something at fault with the appliance itself.
We have a load balancer appliance in each data centre to allow one to fail-over to the other if it needs to, so an emergency change was scheduled for Sunday 15th May to attempt that failover at an off-peak time. This failover was successful and testing revealed that this appliance did not have the issue, confirming a fault with the other appliance.
At the moment we are in a somewhat risky state, as the active appliance needs to be able to rely on the other appliance to fail over to if it failed for any reason, unlikely though that is. We do still have the original workaround if that did happen. We have started conversations with the F5 support team about the problem, and we were already in the process of acquiring brand new F5 load balancers anyway which will replace these two appliances over the coming months, and should be more reliable.