Wednesday, July 20, 2011

Postmortem: Java App Engine outage, July 14, 2011

Summary
Last week, we posted about a limited outage on July 14, 2011. Now that our internal postmortem is complete, we thought you would also like to get more detail about what went wrong and what we are going to do to ensure this doesn't happen again.

Root Cause and Analysis
The main lesson learned is to improve our live traffic testing as a relatively minor bug triggered a corner case for some of our customers. The bug was in a new release of the infrastructure in the App Engine Java execution environment. During development, testing, and qualification, this bug was essentially hidden from view because it only manifested itself under specific load patterns. During the outage, requests to affected applications would fail with errors when traffic was routed to affected instances. Application logs would have shown affected instances experienced high latency, error rates, or were not reachable from the Internet. This could have been caught by letting the live traffic testing run longer.
In order for live traffic testing to work properly, we need to improve our monitoring as well. In this case, having more points from which to do black box monitoring would have helped immensely. We are currently working on much broader monitoring for App Engine and will be integrating more extensive black box testing in upcoming quarters.
Once again, we’d like to point out that we could have done a much better job of communicating issues to all of you. While we strive to strike a balance between letting you know about major issues and not bothering you about the day-to-day operations; we clearly should have communicated this incident to you sooner. Rest assured you’ll be better informed of issues in the future.

Timeline
July 14, 2011 - 11:30 AM US/Pacific - The new Java execution environment is released to production.
July 14, 2011 - 5:00-6:00 PM US/Pacific - The previously scheduled Master/Slave read-only maintenance period occurred.
July 14, 2011 - 8:00-9:30 PM US/Pacific - Monitoring shows error rates and latency for Java applications using the Master/Slave datastore are slowly increasing across the entire system. Investigation reveals that the new Java execution environment is malfunctioning.
July 14, 2011 - 9:30 PM US/Pacific - Rollback of the Java execution environment to the previous version begins. Latency and error rates begin to fall.
July 14, 2011 - 11:30 PM US/Pacific - Rollback of the Java execution environment to the previous version completes. Java Master/Slave applications are functioning normally.

Remediation
  • Faster notification on our status site and downtime-notify mailing list
  • More live traffic stress tests for new releases
  • Better black box monitoring to detect small impacts more quickly
[Edit] Clarification: no HR datastore apps were affected. Overall, the outage resulted in a 1.9% error rate, affecting approximately 0.005% of all App Engine traffic at peak.

2 comments:

Frederic Conrotte said...
This comment has been removed by the author.
bendanpa said...

Since HR datastore apps did not get affect in this downtime, I hope GAE to have a way to help migrate datastore to HR datastore.