Fix to prevent repeat of May outage of Google’s cloud to sacrifice performance
Company releases detailed account of the
App Engine outage and explains lessons learned
(6/10/2010)
The team that operates App Engine,
Google’s cloud-based application development platform, said its highest
priority after fixing the latency problems that caused the platform to
grind to a halt for more than 2.5 hours in May would be creating an
alternative configuration of Datastore, the cloud database offered as
part of the App Engine service that caused the outage.
“It is
critical to offer an alternative configuration of the Datastore,” the
team wrote in an outage post-mortem released Thursday. “This
implementation should be much less susceptible to outages and will
prevent any replication loss during outages, but will trade off
performance.”
The incident began when Datastore started seeing
increased latency caused by infrastructure instability around 12:30 p.m.
on May 25. The latency resulted in write replication to the secondary
data center to slow “to a crawl.” The App Engine team began the failover
procedure about five minutes after the latency increase was noticed. By
1:05 p.m., read queries were being served out of the secondary data
center and about 10 minutes later the team made an external announcement
of the outage.
Another 10 minutes later, the secondary data
center began serving both read and write traffic, however, latency is
still causing a high rate of request time-outs. By 2:20 p.m. all
applications, except the large ones, stabilized and by 3:10 p.m., all
applications returned to a normal sate of operation.
Here’s
what caused it:
A repository component of Bigtable (a
distributed storage system) that determines location of entities in the
distributed system became overloaded because of instability in the
compute cluster. The overload prevented requests from knowing where to
send Datastore operations in time to avoid read and write request
time-outs, which occur 30 seconds after a request is made.
Delays
in processing of Datastore requests caused other requests to stack-up
beyond App Engine’s safety limit, causing all App Engine requests to
fail.
Because of the outage, some writes created by the primary
Datastore were not applied to the secondary Datastore during failover,
“causing the mirror image between the primary and secondary Datastore to
be out of sync.” Google said its team will contact all administrators
whose applications were affected and instruct them to take appropriate
action. The company said two percent of all applications were affected.
- wong chee tat :)
No comments:
Post a Comment