Hosting Outage of January 11, 2018 – Best news article

On January 11, 2018, our shared hosting platform suffered an embarrassing outage, starting in the mid-afternoon, and lasting in to the early evening.  We apologize sincerely to our clients for this outage and the inconvenience it caused you.  Only web hosting was affected, email and DNS services were not affected in any way.

The cause of the outage was patches applied to our server to help mitigate the threat of the Meltdown/Spectre bugs.  A good non-techy explanation of these bugs can be found here: CloudFlare Blog.  While our server patches are prepared & vetted by a third-party partner, nothing is perfect.  The Meltdown/Spectre bugs are so serious and drastic, that many vendors are rushing patches into production, and some of these patches are causing serious issues.

Soon after applying our patches on the 8th of January, our team started noticing weird performance issues, similar to what many other companies have reported, but we were unable to determine if these issues were being caused by our server patches or the patches that Amazon applied to their AWS service, which our servers run on.  Unfortunately, today the performances issues because quite severe, and an attempt by our team to restart the hosting service resulted in the service refusing to start again.

Our team diagnosed the problem and it was discovered that the issue was a bug in the latest kernel patch.  Once this determination was made, our team manually rolled back the kernel patch, resulting in the server being up again within 1 hour of the issue being discovered.  However, the process of manually rolling back a kernel patch necessarily results in each individual website on our server needing an IP address re-configuration, a process that takes many hours to achieve.  While the root cause of the outage was fixed within 60 minutes, the cleanup took a further 4 hours to complete.

As a result of this outage, our team has implemented an automatic kernel patch roll-back system.  Essentially, what will happen in the future is that if a kernel patch causes a frozen service, then our system will automatically roll-back the patch, with no manual intervention required, and no domain reconfiguration required either.

Today’s issue has only happened one other time in the history of our hosting systems, but we feel it is very important to address it permanently, as patches are becoming more and more necessary, and it is likely that this could happen again in the future.  Thus, our new automatic kernel patch rollback system is expected to resolve such an issue automatically and in less than 5 minutes.

Again, we apologize to our clients and thank you for your patience during this matter.

 

Leave a Reply