Gmail, the email service of Google, had failed for approximately 100 minutes on September 1, 2009. Google has apologised for the outage and said that it has thoroughly investigated the issue, and consequently plans to fix or improve a list of things to ensure that such an event does not occur again.     

The outage was caused when few of its servers were taken offline to perform routine upgrades. The company said that it had underestimated the load placed on the routers by some recent changes, due to which a few of the request routers became overloaded and in effect transferred the load onto the remaining request routers. Within minutes nearly all of the request routers were overloaded.

Thus the user requests could not be routed to a Gmail server and hence people using the web interface couldn’t access Gmail. The company added that IMAP/POP access and mail processing functioned normally because these requests do not use the same routers.

Gmail said that it was alerted to the issue within seconds and after figuring out the actual problem, its team got additional request routers online in order to distribute the traffic across request routers.

Ben Treynor, VP engineering and Google site reliability czar, said in the Gmail blog: “We’ve turned our full attention to helping ensure this kind of event doesn’t happen again.”

The company said that it plans to increase request router capacity beyond peak demand to provide headroom. Mr Treynor added that request routers should have sufficient failure isolation so that if there’s a problem in one datacenter, it shouldn’t affect servers in another datacenter. He also said that the request routers should get slower instead of refusing to accept traffic and shifting their load in case they are overloaded simultaneously.

Reportedly, this is the second outrage of Gmail after the one on August 31 that wiped out email to a ‘small subset’ of users. Earlier, Gmail outage had occurred for four hours in February and 20 minutes in May.