Tuesday, June 15, 2010

Fix to prevent repeat of May outage of Google’s cloud to sacrifice performance

Fix to prevent repeat of May outage of Google’s cloud to sacrifice performance  

 
Company releases detailed account of the App Engine outage and explains lessons learned (6/10/2010) The team that operates App Engine, Google’s cloud-based application development platform, said its highest priority after fixing the latency problems that caused the platform to grind to a halt for more than 2.5 hours in May would be creating an alternative configuration of Datastore, the cloud database offered as part of the App Engine service that caused the outage.

“It is critical to offer an alternative configuration of the Datastore,” the team wrote in an outage post-mortem released Thursday. “This implementation should be much less susceptible to outages and will prevent any replication loss during outages, but will trade off performance.”

The incident began when Datastore started seeing increased latency caused by infrastructure instability around 12:30 p.m. on May 25. The latency resulted in write replication to the secondary data center to slow “to a crawl.” The App Engine team began the failover procedure about five minutes after the latency increase was noticed. By 1:05 p.m., read queries were being served out of the secondary data center and about 10 minutes later the team made an external announcement of the outage.

Another 10 minutes later, the secondary data center began serving both read and write traffic, however, latency is still causing a high rate of request time-outs. By 2:20 p.m. all applications, except the large ones, stabilized and by 3:10 p.m., all applications returned to a normal sate of operation.

Here’s what caused it:
A repository component of Bigtable (a distributed storage system) that determines location of entities in the distributed system became overloaded because of instability in the compute cluster. The overload prevented requests from knowing where to send Datastore operations in time to avoid read and write request time-outs, which occur 30 seconds after a request is made.

Delays in processing of Datastore requests caused other requests to stack-up beyond App Engine’s safety limit, causing all App Engine requests to fail.

Because of the outage, some writes created by the primary Datastore were not applied to the secondary Datastore during failover, “causing the mirror image between the primary and secondary Datastore to be out of sync.” Google said its team will contact all administrators whose applications were affected and instruct them to take appropriate action. The company said two percent of all applications were affected.

- wong chee tat :)

No comments: