As you may be aware, we recently suffered the worst single incident in our history due to a power outage at our Leeds data centre on Wednesday afternoon.
Emergency maintenance work was being carried out on the load transfer module, which feeds power from our external energy supplies to the data centre hall that holds the majority of our servers. The data centre has 2 dual feed uninterruptible supplies both backed by diesel generators in case of National Grid outages.
Unfortunately, a safety mechanism within the device triggered incorrectly, and resulted in a power outage of fewer than 9 minutes. Subsequently, this caused approximately 15,000 servers to be hard booted. Beyond a fire, this is the worst possible event that a hosting company can face. A full post mortem is currently being carried out to determine how power was lost on both supplies despite working with the external engineer from the hardware manufacturer.
What happens when servers hard reboot?
Web servers and virtual servers typically perform database transactions at a very high rate, meaning that the risk of database or file system corruption is quite high when a hard reboot occurs.
Following the restoration of power, our first priority was to get our primary infrastructure boxes back online, then our managed and unmanaged platforms. Our managed platforms are built to be resilient, so although we lost a number of servers in the reboot, the majority of our platforms came up cleanly. We faced some issues with our Premium Hosting load balancers, which needed repairing, so some customer sites were off for longer than we would have hoped. We are adding additional redundant load balancers and modifying the failover procedure over the next 7 days as an extra precaution for us and our customers.
On our shared hosting platform, a number of NAS drives, which sit behind the front-end web servers and hold customer website data, crashed and could not be recovered. However, they are set up in fully redundant pairs and the NAS drives themselves contain 8+ disk RAID 10 arrays. In every case but one, at least one server in each pair came back up cleanly, or in an easily repairable state, and customer websites were back online within 2-3 hours.
In a single case, the cluster containing web 75-79, representing just under 2% of our entire shared platform, both NAS drives failed to come back up. Following our disaster recovery procedure, we commenced attempts to restore the drives, whilst simultaneously building new NAS drives should they be required. Unfortunately, the servers gave a strong, but false, indication that they could be brought back into a functioning state, so we prioritised attempts to repair the file system.
Regrettably, following a ‘successful’ repair, performance was incredibly poor due to the damage to the file system, and we were forced to proceed to the next rung of our disaster recovery procedure. The further we step into the disaster recovery process, the greater the recovery time, and here we were looking at a total 4TB restore from on-site backups to new NAS drives. (For your information the steps following that are to restore from offsite backup and finally restore from tape backup although we did not need to enact these steps.) At this point, it became apparent that the issue would take days rather than hours to resolve, and the status page was updated with an ETA. We restored sites to the new NAS drives alphabetically in a read-only state and the restoration completed late Sunday afternoon.
A full shared cluster restore from backups to new NAS is a critical incident for us, and we routinely train our engineers on disaster recovery steps. Our disaster recovery process functioned correctly, but because the event did not occur in isolation, we were unable to offer the level of individual service that we really wanted to, and that you would expect from us (e.g. individual site migration during restoration).
Given the magnitude of this event, we are currently investigating plans to split our platform and infrastructure servers across two data centre halls, which would allow us to continue running in the event of complete power loss to one. This added reliability is an extra step that we feel is necessary to put in place to ensure that this never happens again for our customers.
VPS and Dedicated Servers
For our unmanaged platforms (VPS and Dedicated Servers), the damage was more severe, as by default these servers are not redundant or backed up. In particular, one type of VPS was more susceptible to data corruption in the event of a power loss due to the type of caching the host servers use. We have remedied this issue on all re-built VPS involved in the outage, and no active or newly built VPS now suffer from this issue.
We did lose two KVM hosts (the host servers that hold VPS, approximately 60-80 servers per VPS KVM host, 6-12 servers per Hybrid KVM host). The relatively good news was that the underlying VPS data was not damaged, although further to this, we also lost two KVM network switches which needed to be swapped out, which did result in intermittent network performance on other VPS during the incident.
To bring the VPS back online, the KVM hosts needed to have replacements built and VPS data copied from each before being brought back online. For every other VPS, the host servers were back up and running within 2 hours, but in many cases, the file systems or databases of the virtual machines on those servers were damaged by the power loss. For these VPS, by far the quickest course of action for customers to get back up and running immediately was a rebuild and restore from backups (either offsite or via our backup service).
However, we realised quickly that many of the affected VPS customers did not have any backups (irrespective of whether the backup was with us), and the only copy of the server’s data was held in a partially corrupted form on our KVM hosts so we took steps to attempt to get customers back online. For every affected VPS we ran an automated fsck (file system check) in an effort to bring the servers back online in an automated fashion. This would not, however, fix issues with MySQL, which would be the most common issues due to high transaction rate. Tables left open during a power loss are likely to result in corrupted data, so we provided a do-it-yourself guide to try and get MySQL into a working state.
We provided the option for us to attempt a repair, which typically takes 2-3 hours per server with an expected success rate of approximately 20%. We currently have a backlog of servers we have agreed to attempt to recover, but given the time per investigation, this is likely to take most of the week. This is roughly equivalent to the total loss of our NAS pair and is where disaster recovery steps (server rebuild and backup restoration) should be followed.
As these servers are unmanaged, there is no disaster recovery process in place by default. I know this isn’t the answer many of you want to hear, and most of all we want to ensure that this can never happen to you again. All VPS hosts are now set to be far more resilient in the event of a sudden power loss.
Support and Communications
During this incident, we have worked our hardest to ensure that our entire customer base was kept informed of our progress through our status page.
Given the scale of the issue, the load on our Customer Services team was far in excess of normal levels. On a standard day, we handle approximately 800 support tickets, which can rise to 1600 during a fairly major incident. At absolute capacity, we can handle approximately 2000 new tickets per day.
This event was unprecedented, so during and following the incident we received in excess of 5000 new support tickets every day (excluding old tickets that were re-opened), and the ticket complexity was far higher than usual. Our admin system was not set up to handle this number of requests (being poll heavy to give our team quick updates on our ticket queue). This heavily impacted the performance of our control panel and ticketing system until we made alterations to make it far less resource intensive.
After this, we took immediate steps to ameliorate the incredible support load via automated updates to affected customers, but most of the tickets required in-depth investigation and server repairs that require a high level of technical capability, so could only be addressed by our second line and sysadmin staff. It will take some time to clear our entire ticket backlog and restore normal ticket SLAs.
We had planned to go live with a brand new Heart Internet customer specific status page on the day of the outage, as it would allow us to provide greater detail for direct customers without the requirement that messages be white labelled and generic.
We did not push this live during the incident as we needed all hands on to fix the live issues, but we have just made it live at status.heartinternet.uk (it will later also be available at http://heartstatus.uk using external DNS). The service allows for subscription via email, SMS, and RSS, so you will be kept up-to-date during any major incident. Past events are also archived and remain fully visible. We will also use this page to inform you of any changes to the platform or scheduled work.
Most of all we’d like to apologise to you, and to your customers. We know as much as anyone how important staying online is to your business. The best thing we can do to regain your trust is to offer good, uninterrupted service long into the future, and that is now our utmost priority.