Thursday, February 26, 2009

GMail outage...

A couple of days ago there was a GMail outage. Short summary, a combination of taking down a data center and a bug in the load distribution code leads to overloaded data center, and thus a bunch of people can't get to their mail.

A friend ribbed me about the readyness of Cloud Computing, on the back of this outage. My initial reaction was to defend Google on uptime stats alone. Now that Google have spread the love and admitted the cause, I'm going out on a limb and stating that very few (if any) corporate IT shops are prepared for dealing with entire data centers being taken off line for maintenance.

I mean, sure, there are corporates who have back up data centers, and even have machines in stand by DR mode. I'd be exceedingly surprised, however, if everything actually worked. Most DR tests I've seen were epic fails, and the fun thing is that DR readyness tests usually happen once a year, if at all.

If a corporate wants DR to work, it has to be a regular thing. Schedule DR cut over every friday night, and keep repeating until it works flawlessly. And then bring it back every month to make sure it keeps working flawlessly. Otherwise the business as usual is going to be in for a shock when things go pear shaped for real.