Friday, October 26, 2012

About today's App Engine outage

This morning we failed to live up to our promise, and Google App Engine applications experienced increased latencies and time-out errors.

We know you rely on App Engine to create applications that are easy to develop and manage without having to worry about downtime. App Engine is not supposed to go down, and our engineers work diligently to ensure that it doesn’t. However, from approximately 7:30 to 11:30 AM US/Pacific, about 50% of requests to App Engine applications failed.

Here’s what happened, from what we know today:

Summary

  • 4:00 am - Load begins increasing on traffic routers in one of the App Engine datacenters.
  • 6:10 am - The load on traffic routers in the affected datacenter passes our paging threshold.
  • 6:30 am - We begin a global restart of the traffic routers to address the load in the affected datacenter.
  • 7:30 am - The global restart plus additional load unexpectedly reduces the count of healthy traffic routers below the minimum required for reliable operation. This causes overload in the remaining traffic routers, spreading to all App Engine datacenters. Applications begin consistently experiencing elevated error rates and latencies.
  • 8:28 am - google-appengine-downtime-notify@googlegroups.com is updated with notification that we are aware of the incident and working to repair it.
  • 11:10 am - We determine that App Engine’s traffic routers are trapped in a cascading failure, and that we have no option other than to perform a full restart with gradual traffic ramp-up to return to service.
  • 11:45 am - Traffic ramp-up completes, and App Engine returns to normal operation.

In response to this incident, we have increased our traffic routing capacity and adjusted our configuration to reduce the possibility of another cascading failure. Multiple projects have been in progress to allow us to further scale our traffic routers, reducing the likelihood of cascading failures in the future.  

During this incident, no application data was lost and application behavior was restored without any manual intervention by developers. There is no need to make any code or configuration changes to your applications.

We will proactively issue credits to all paid applications for ten percent of their usage for the month of October to cover any SLA violations. This will appear on applications’ November bills. There is no need to take any action to receive this credit.

We apologize for this outage, and in particular for its duration and severity. Since launching the High Replication Datastore in January 2011, App Engine has not experienced a widespread system outage.  We know that hundreds of thousands of developers rely on App Engine to provide a stable, scalable infrastructure for their applications, and we will continue to improve our systems and processes to live up to this expectation.

- Posted by Peter S. Magnusson, Engineering Director, Google App Engine

33 comments:

Paul Bailey said...

So what was the cause of the increased load on the routers? And is this related to other outages around the internet today?

a said...

Thank you Google for all of your hard work on this and for being open with what happened. It is clear that you are listening to users feedback on this and have made great strides in how you deal with downtime. I'm very happy I'm on AppEngine and continue to be impressed with all that has gone into it.

HighVolumeSeller Blog said...

Echo what Jon said. I really appreciate the transparency, and I have production paid apps that were affected, but when you come out and are upfront like this, it's easy to appreciate the complexity you're dealing with, and I feel bad for you guys more than anything. Keep up the great, great work App Engine Team!!

Anonymous said...
This comment has been removed by the author.
Anonymous said...

The increased traffic that your post mentions resulted in my application spawning as many as 120 additional instances over a very short period of time. Was App Engine the target of a DDoS attack? When I look at my dashboard, the request frequency pattern seems consistent with normal usage, albeit a lot more of it. Were your traffic routers replaying/repeating lots of requests?

Anonymous said...

@Jon Grall

App Engine shouldn't replay requests, because we can't be sure that the request is idempotent (even if it is a GET request). I'm not completely certain that there is no replay, but I would be super surprised if that were the case.

Perhaps your app saw more traffic, because people were hitting your app more often (e.g. repeatedly hitting reload in their browsers), due to intermittent failures? Maybe your users were doing their own version of "reply"?

netzeta said...

Today we have seen 380 instances spawned during that short period

Feedback through groups is good...but many times memcache service fails as a result our service fails...if you guys can make memcache implementation more robust that would be great


Jacob Taylor said...

We noticed a 5x increase in server instances. I think the scaling algorithm kicked in when instance latency grew to 60 seconds. Request latency is a key component in the decision to spawn more instances, right?

Jacob

wreleven said...

@Jon Grall

My app experienced the same issue a few days ago. Over 64K warm up requests were issues from a google server. At times I had over 200 instances spun up. My budget was exceeded twice and I finally had to update my app and remove the warm up service.

wreleven said...

I have a question about the refunds. We have an app that receives adwords clicks that went down during this issue. Many of those adword clicks were obviously wasted due to the app engine downtime. What's googles policy on inter-service refunds?

Renan Mobile said...

I really trust on you Google. Thanks for be professional and for the fast feedback. But please, don't let us to get downtime again.

Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Renan Mobile said...
This comment has been removed by the author.
Unknown said...

You need to implement a better communication - help channel... I spent about 2 hours to get an official annouce about the incident, I thought the problem was the internet connection, dns, or my own app, I think if this happens again, the most important thing is to communicate quickly and keep us updated.

Unknown said...

since the app engine dashboard itself was down too, +1 for faster acknowledgement methods outside of the same stack. (G+, Twitter, email)