Friday, September 24, 2010

Facebook apologizes for worst outage in 4 years

Facebook apologizes for worst outage in 4 years

TAIPEI, 24 SEPTEMBER 2010 -

Many Facebook users were unable to access the social networking site for up to two and a half hours on Thursday, the worst outage the website has had in over four years, Facebook said in a posting.

The problems were traced back to a change made by Facebook in one of its systems.

The change was made to a piece of data that was called upon whenever an error-checking routine found invalid data in Facebook's system. The piece of data was itself interpreted as invalid, which caused the system to try and replace it with the same piece of data and so a feedback loop began.

The loop resulted in hundreds of thousands of queries per second being sent to Facebook's database cluster, overwhelming the system.

The result for users was a "DNS error" message and no access to the site.

"The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site," wrote Robert Johnson, director of software engineering at Facebook, in a post on the site. "Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site."

The problem hasn't been entirely fixed. Johnson said Facebook had to turn off the automated system to get the website back up and running. But that system does play an integral role in protecting the website.

Facebook is now exploring new ways to handle the situation so it won't lead to another feedback loop.
"We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously," he wrote.

It's the second day Facebook was brought down for some users. On Wednesday, Facebook blamed a third-party networking provider for making the site inaccessible to some.


- wong chee tat :)


More Details on Today's Outage (Facebook)

by Robert Johnson on Friday, September 24, 2010 at 8:29am
Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.

The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.

Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.

The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.

We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously. 


- wong chee tat :)