For approxately 2 hours, between 10:15am and 12:35pm, the majority of our storage servers, responsible for our Cloud, became unavailable. The Power Distribution Unit for the rack failed, resulting in 8 storage nodes losing power. This included the storage node that hosts our own website, thus, was also unavailable during this time.

Our engineers have bypassed the problem and routed new power lines to the rack. All nodes are currently back online. All affected Cloud VMs are being restored now. 

During the first few hours after recovery, disk performance may be slower than usual as our system checks for data integrity issues, and our cache is rebuilt.

In light of this failure, we plan to implement a more distributed storage design. Too much of our storage was located in a single rack. We will further add additional power circuits from seperate sources, which will help to avoid single points of failure in the power distribution network. This will be implemented in full before the end of 2021.

Saturday, June 5, 2021

« Back