We are giving here an analysis of what happened a couple of weeks ago regarding the fire in the datacenter in Strasbourg. We want to give you some insight on our perspective of what went right, what didn't go so well, and most importantly on what we learned.
First of all: a big thanks goes to YOU, our clients! We were very grateful to see that our clients were understanding and patient of the situation. We understand that you rely on us to provide you with the best service and that your success is closely connected with your website, consequently any minute of downtime causes stress and discomfort on your side. Nonetheless you allowed us to work intensively without interruptions to get the sites up and running as fast as possible: once more, thank you!
Let's now address the nature of this incident: what happened in Strasbourg with our server provider is of exceptional nature and unprecedented; as stated by multiple experts, it will have an impact on the datacenter industry. It is the absolutely worst that can happen to a datacenter provider. The news about the fire went global in hundreds of newssites, as well as national in where Ramsalt was mentioned in half a dozen of them. After the fire OVH ordered 15 000 new servers. The server assembly line capacity has tripled to attempt to meet the high demand after the fire.
To put this in context, the estimated impact is that 3,6 million websites as well as email and similar services for smaller and larger organizations. Also governmental sites from several countries were affected by the fire. One datacenter was burnt down completely, another one partly damaged, and then another two were shut down as a safety measure due to damaged power lines. Altogether causing over half of Ramsalt’s production and development servers inaccessible, bringing with it also most of the hot backups.
Here is a rough timeline of events from our perspective, starting from the night of 10 Mar 2021 (all times are in expressed in Europe/Oslo timezone):
- 00:35 - The first notification from our monitoring is fired about the first site being unreachable. status.ramsalt.com is being updated from this point on.
- 00:50~01:00 - Some other sites are having issues and another series of notifications are sent out.
- 01:15 - We are investigating the source of the problems.
- 01:35 - The datacenter provider informs about a degradation in service performance, with little extra information (at this point).
Knowing the datacenter provider is working on the issue we decide to monitor the situation closely, but wait for more information from the datacenter provider to decide the course of actions.
- 05:30 - We come to the knowledge that our main datacenter is in flames and quickly we start organising the Disaster Recovery Plan while an Emergency Recovery Team is assembled. We immediately decided to set up new servers in a new datacenter far away from the fire.
- 06:10 - The first communication is sent out to our clients informing you about the situation.
- 06:30 - The Emergency Recovery Team is ready and the process to restore most of our production environment is planned. We already understand that will be a long process as this will require recovering the data from our off-site backups placed in deep-storage which by default is extremely restricted to prevent unauthorized access or tampering.
- 07:30 - The first four replacement servers are being prepared to be connected to our infrastructure management service.
- 08:50 - The first site is back up!
At this point we are quite positive that we should be able to restore everything within the morning/day, however it is now that new additional difficulties are appearing:
Between circa 09:00 through 17:00 the control panel of our service provider is having serious trouble and makes our job very hard, due to the high traffic on their infrastructure (many companies reconstructing their environments) we encounter slow-downs, timeout errors, and in general all sort of problems to attribute to the high demand.
- 12:30 - We finally were able to acquire all the servers needed to restore all the production environment, but we still have trouble redirecting the Floating IPs to the new servers.
- 15:00 - At this point roughly 20% of the sites are back up and the Emergency Recovery Team is still working hard to restore the others.
- 15:45 - All servers are connected to our infrastructure management service, we still have some issues with the Floating IPs, but this is the last major blocking obstacle.
- 16:00 ~ 23:00 - We progressively restore over the vast majority of the sites.
Only a very small fraction of the sites will experience a downtime over 24h.
- 23:30 - The Emergency Recovery Team calls the end of the day as we are now mostly waiting for DNS changes which we do not have control over.
From the next day (11 Mar) we continued working to get hold of the remaining sites which required DNS changes and except a couple of outliers we have all sites back within 48h since the fire did spread in our datacenter.
Our analysis and plan for the future
This was a large-scale event that destroyed all the production and development data, servers and most hot-backups for the majority of our clients. We are in general pleased that our Disaster Recovery Plan (DRP) worked and we managed to restore our sites. The communication about our progress went mainly through status.ramsalt.com for frequent updates and with mail when we had bigger news.
What went right
As we mentioned, we are quite pleased with how we came out of a fire that took out four datacenters. What follows is what we believe to be the highlights.
- We were successful in restoring the majority of our production infrastructure in less then 12 hours and restore roughly 90% of the sites in less than 24 hours. This could have been far worse if we were not properly prepared to deal with this.
- Our DRP included backup being synced to a separate service provider: this proved to be invaluable, since the other 2 backup tiers were either (literally) up in flames or simply not accessible for weeks. A 3-tier backup solution was more expensive but clearly the right choice.
- One more positive experience is the collaboration with our infrastructure partners, they really shined during this moment of crisis by helping us for many hours to restore the production environment which allowed us to bring our clients online in a reasonable amount of time, given the exceptional situation.
Where we can improve
Considering this a disaster of unprecedented size, there are also things which we would rather not think about, but we see it as our duty to be transparent with our clients who put a lot of trust in us and most of all we can learn from our mistakes by analyzing our mistakes. What follows is the list of what we believe to be the worst defeats we encountered and which we will work to improve.
- While our backup strategy in theory was close to be perfect and, to be clear where it was properly set up it worked exactly as expected, had one critical fault: it still relied on a manual setup. While it is basically just a couple of clicks and can be completed in roughly 2 minutes, it is still a manual action, which led to some instances lacking this additional level of safety.
Furthermore we did not have an easy way to list all backup status, which led to this oversight
This is by far our most painful point and biggest failure.
- The Emergency Recovering Team did an amazing effort, starting in the early hours of the day and working up to 20 hours almost without breaks, this is for sure a great feat from this team, however the issue was that we found it hard to scale due to the lack of internal documentation about the process to execute a full restore at the magnitude of this impact.
This did cost us quite some time to complete the restoration of the sites.
- We encountered some sync issues while running the communication planned in the Emergency Recovery Plan. The list we relied on for the communications was slightly out of sync, which meant that some sites were never added, some others had an old contact email and similar smaller issues.
The cause of this is to be attributed to the fact that we had to manually keep the lists up to date and the result was that we did not update all customers as good as we would have wanted.
- For all sites we host we set up uptime monitoring which notifies us roughly within 1 minute if a site is not reachable or too slow to answer. This is one of the many measures we enact to ensure that we have all under control.
However because this relies on a manual process we had some sites which were lacking monitoring here too, which led to a couple of sites being down for a longer period of time than necessary.
What we will work on
In the previous section we highlighted what went right and what went wrong, but how do we go from here?
After the incident we had several internal discussions about what happened and most importantly where and how we could improve, what follows here is the list of the most important areas we will focus to improve on.
- First of all and most important: Automate all things!
By just looking at the list of what went wrong it is clear that our biggest issue is relying on manual tasks for critical changes.
- Integrate services so we reduce human errors and do better quality assurance for each setup.
- See if we can speed up the recovery process in case of future accidents.
- Improve communication with your clients, especially after the outage.
- Run more emergency exercises periodically.
- Customer panel aka "min side" where
- each client can update contact information and get status on their own services.
There are high availability solutions you can choose that would keep your services and websites running without interruption, even if a datacenter is put out of a fire. By having a load balancer and another mirrored server in a different location and datacenter. High availability clusters are more costly, but might be worth the investment if you can’t afford the risk of not having your services available. Get in touch with us for more information.
We would like to thank all our patient clients that believe in our services. We do not take lightly on this, and will continue to improve our services to ensure better stability and routines to keep your sites and services up and running.
If you have any questions please do not hesitate to contact us.