Apathy Sketchpad

 

As you probably noticed, the site was not working for most of yesterday. This is the reason, according to my hosts’ offsite status page (which you can check during any downtime to find out what’s up):

Somebody shut off the power to the entire third floor of our Phoenix location. We are investigating. …

On-site technicians claim this is scheduled maintenance of the power grid. I don’t know who they’re supposed to have scheduled with, but it wasn’t us. …

“Escalate” is now a very angry verb.

The story as of now: the building owner allegedly scheduled this power maintenance. Our colo provider, which owns the cage space in that building where our equipment is located, claims they notified us. They didn’t.

They have finished screwing around with half of our equipment (the “A” power feeds). They will begin shutting down the B feeds shortly. We’re looking to see what we can get running on half power.

Not much. They turned the power back on, but none of the servers. …

They’ve acknowledged their failure to notify us. Small consolation. We’re still working on recovering all the servers that died in mid-file-write.

Our master MySQL server seems to be the primary problem at this point. We are working to resolve it as quickly as it can, but it seems to have blown enough drives that the RAID won’t start. We’re working on multiple recovery strategies in parallel. Unfortunately, the server with the most recent backups is still checking its disks.

We are still hard at work on this. We’re taking care of the physical drive swaps; it’s time consuming to locate enough spares at 5am on a Sunday morning, much less replicate them all. …

Moral of this story: There’s no point in booting FreeBSD off of mirrored drives, because if one of the drives fails (for example by having its power flipped on and off over and over), it won’t boot anyway. We have multiple examples. :-(

For those who want to know, the “power maintenance” was perpetrated by the building owner, Digital Realty Trust. The lack of communication was the responsibility of our cage broker. The time-to-recover is our responsibility, as we had no plan for this many critical servers being rendered inoperable all at once. There will be a reckoning, but I think we’re going to have to go back to the drawing board and finish our plans for a truly distributed shared hosting network with zero single points of failure.

Our recovery efforts are proceeding. We are being double-extra-super cautious, because data loss is unacceptable.

On the plus side, they can’t charge me while the site’s down.

[?]
You can leave a response, or trackback from your own site. You may star this entry if you think people will enjoy it.

Leave a Reply

Recently Starred

Other pages


More Of Me


Recent Comments


Google Talk


Other Things


Internal


Archives



Apathy Sketchpad is proudly powered by WordPress
Entries (RSS) and Comments (RSS).