We are giving here an analysis of what happened a couple of weeks ago regarding the fire in the datacenter in Strasbourg. We want to give you some insight on our perspective of what went right, what didn't go so well, and most importantly on what we learned.
First of all: a big thanks goes to YOU, our clients! We were very grateful to see that our clients were understanding and patient of the situation. We understand that you rely on us to provide you with the best service and that your success is closely connected with your website, consequently any minute of downtime causes stress and discomfort on your side. Nonetheless you allowed us to work intensively without interruptions to get the sites up and running as fast as possible: once more, thank you!
Let's now address the nature of this incident: what happened in Strasbourg with our server provider is of exceptional nature and unprecedented; as stated by multiple experts, it will have an impact on the datacenter industry. It is the absolutely worst that can happen to a datacenter provider. The news about the fire went global in hundreds of newssites, as well as national in where Ramsalt was mentioned in half a dozen of them. After the fire OVH ordered 15 000 new servers. The server assembly line capacity has tripled to attempt to meet the high demand after the fire.
To put this in context, the estimated impact is 3,6 million websites as well as email and similar services for smaller and larger organizations were affected. Also governmental sites from several countries were affected by the fire. One datacenter was burnt down completely, another one partly damaged, and then another two were shut down as a safety measure due to damaged power lines. Altogether causing over half of Ramsalt’s production and development servers inaccessible, bringing with it also most of the hot backups.
Here is a rough timeline of events from our perspective, starting from the night of 10 Mar 2021 (all times are in expressed in Europe/Oslo timezone):
At this point we are quite positive that we should be able to restore everything within the morning/day, however it is now that new additional difficulties are appearing:
Between circa 09:00 through 17:00 the control panel of our service provider is having serious trouble and makes our job very hard, due to the high traffic on their infrastructure (many companies reconstructing their environments) we encounter slow-downs, timeout errors, and in general all sort of problems to attribute to the high demand.
From the next day (11 Mar) we continued working to get hold of the remaining sites which required DNS changes and except a couple of outliers we have all sites back within 48h since the fire did spread in our datacenter.
This was a large-scale event that destroyed the production and development data, servers and most hot-backups for the majority of our clients. We are in general pleased that our Disaster Recovery Plan (DRP) worked and we managed to restore our sites. The communication about our progress went mainly through status.ramsalt.com for frequent updates and with mail when we had bigger news.
As we mentioned, we are quite pleased with how we came out of a fire that took out four datacenters. What follows is what we believe to be the highlights.
Considering this a disaster of unprecedented size, there are also things which we would rather not think about, but we see it as our duty to be transparent with our clients who put a lot of trust in us and most of all we can learn from our mistakes by analyzing our mistakes. What follows is the list of what we believe to be the worst defeats we encountered and which we will work to improve.
In the previous section we highlighted what went right and what went wrong, but how do we go from here?
After the incident we had several internal discussions about what happened and most importantly where and how we could improve, what follows here is the list of the most important areas we will focus to improve on.
There are high availability solutions you can choose that would keep your services and websites running without interruption, even if a datacenter is put out of a fire. By having a load balancer and another mirrored server in a different location and datacenter. High availability clusters are more costly, but might be worth the investment if you can’t afford the risk of not having your services available. Get in touch with us for more information.
We would like to thank all our patient clients that believe in our services. We do not take lightly on this, and will continue to improve our services to ensure better stability and routines to keep your sites and services up and running.
If you have any questions please do not hesitate to contact us.
Vi er på jakt etter deg som liker både folk og teknologi, men ikke nødvendigvis har noen teknologiutdanning.
let's get on with what Drupal 10 brings, a host of new features and improvements, making it the most powerful and flexible version of Drupal yet and how we can help you get there.
This year it was Nina’s first time at DrupalCon. Nina is one of our project managers who is young, eager and willing to learn more about the technical aspects of Drupal so she can use the knowledge to improve her day-to-day project management routines. Since DrupalCon is mostly attended by developers we wanted to interview Nina about how she experienced her first DrupalCon as a project manager.