Tuesday, November 29, 2011

Scaling with the Kindle Fire

Today’s blog post comes to us from Greg Bayer of Pulse, a popular news reading application for iPhone, iPad and Android devices. Pulse has used Google App Engine as a core part of their infrastructure for over a year and they recently celebrated a significant launch. We hope you find their experiences and tips on scaling useful.




As part of the much anticipated Kindle Fire launch, Pulse was announced as one of the only preloaded apps. When you first un-box the Fire, Pulse will be there waiting for you on the home row, next to Facebook and IMDB!

Scale
The Kindle Fire is projected to sell over five million units this quarter alone. This means that those of us who work on backend infrastructure at Pulse have had to prepare for nearly doubling our user-base in a very short period. We also need to be ready for spikes in load due to press events and the holiday season.

Architecture
As I’ve discussed previously on the Pulse Engineering Blog, Pulse’s infrastructure has been designed with scalability in mind from the beginning. We’ve built our web site and client APIs on top of Google App Engine, which has allowed us to grow steadily from 10s to many 1000s of requests per second, without needing to re-architect our systems.

While restrictive in some ways, we’ve found App Engine’s frontend serving instances (running Python in our case) to be extremely scalable, with minimal operational support from our team. We’ve also found the datastore, memcache, and task queue facilities to be equally scalable.

Pulse’s backend infrastructure provides many critical services to our native applications and web site. For example, we cache and serve optimized feed and image data for each source in our catalog. This allows us to minimize latency and data transfer and is especially important to providing an exceptional user experience on limited mobile connections. Providing this service for millions of users requires us to serve 100Ms of requests per day. As with any well designed App Engine app, the vast majority of these requests are served out of memcache and never hit the datastore. Another useful technique we use is to set public cache control headers wherever possible, to allow Google’s edge cache (shown as cached requests on the graph below) and ISP / mobile carrier caches to serve unchanged content directly to users.



Costs
Based on App Engine’s projected billing statements leading up to the recent pricing changes, we were concerned that our costs might increase significantly. To prepare for these changes and the expected additional load from Kindle Fire users, we invested some time in diagnosing and reducing these costs. In most cases, the increases turned out to be an indicator of inefficiencies in our code and/or in the App Engine scheduler. With a little optimization, we have reduced these costs dramatically.

The new tuning sliders for the scheduler make it possible to rein in overly aggressive instance allocation. In the old pricing structure, idle instance time wasn’t charged for at all, so these inefficiencies were usually ignored. Now App Engine charges for all instance time by default. However, any time App Engine runs more idle instances than you’ve allowed, those hours are free. This acts as a hint to the scheduler, helping it reduce unneeded idle instances. By doing some testing to find the optimal cost vs spike latency tolerance and setting the sliders to those levels, we were able to reduce our frontend instance costs to near original levels. Our heavy usage of memcache (which is still free!) also helps keep our instance hours down.



Since datastore operations used to be charged under the umbrella of CPU hours, it was difficult to know the cost of these operations under the old pricing structure. This meant it was easy to miss application inefficiencies, especially for write-heavy workloads where additional indexes can have a multiplicative effect on costs. In our case, the new datastore write operations metric led us to notice some inefficiencies in our design and a tendency to overuse indexes. We are now working to minimize the number of indexes our queries rely on, and this has started to reduce our write costs.

Preparing for the Kindle Fire Launch
We took a few additional steps to prepare for the expected load increase and spikes associated with the Fire’s launch. First, we contacted App Engine’s support team to warn them of the expected increase. This is recommended for any app at or near 10,000 requests per second (to make sure your application is correctly provisioned). We also signed up for a Premier account which gets us additional support and simpler billing.

Architecturally, we decided to split our load across three primary applications, each serving different use cases. While this makes it harder to access data across these applications, those same boundaries serve to isolate potential load-related problems and make tuning simpler. In our case, we were able to divide certain parts of our infrastructure, where cross application data access was less important and load would be significant. Until App Engine provides more visibility into and control of memcache eviction policies, this approach also helps prevent lower priority data from evicting critical data.

I’m hopeful that in the near future such division of services will not be required. Individually tunable load isolation zones and memcache controls would certainly make it a lot more appealing to have everything in a single application. Until then, this technique works quite well, and helps to simplify how we think about scaling.

To learn more about Pulse, check out our website! If you have comments or questions about this post or just want to reach out directly, you can find me @gregbayer.

Wednesday, November 16, 2011

New Datastore client library for Python ready for a test drive


Last week we announced that App Engine has left preview and is now an officially supported product here at Google. And while the release (and the announcement) was chock-full of great features, one of the features that we’d like to call specific attention to is the new Datastore client library for Python (a.k.a “NDB”).

NDB has been under development for some time and this release marks its availability to a larger audience as an experimental feature. Some of the benefits of this new library include:
  • The StructuredProperty class, which allows entities to have nested structure
  • Integrated two-level caching, using both memcache and a per-request in-process cache
  • High-level asynchronous API using Python generators as coroutines (PEP 342)
  • New, cleaner implementations of Key, Model, Property and Query classes
The version of NDB contained in the 1.6.0 runtime and SDK corresponds to NDB 0.9.1, which is currently the latest NDB release.

Given that this feature is still experimental, it is subject to change, but that’s exactly why we encourage you to give it a test drive and send us any feedback that you might have. The NDB project hosted on Google Code is the best place to send this feedback. Happy coding!


Posted by Guido van Rossum, Software Engineer on the App Engine Team

Monday, November 14, 2011

Google BigQuery Service: Big data analytics at Google speed

Our post today, cross-posted with the Google Enterprise Blog, comes from one of our sister projects, BigQuery. We know that many of you are interested in processing large volumes of data and we encourage you to try it out.


Rapidly crunching terabytes of big data can lead to better business decisions, but this has traditionally required tremendous IT investments. Imagine a large online retailer that wants to provide better product recommendations by analyzing website usage and purchase patterns from millions of website visits. Or consider a car manufacturer that wants to maximize its advertising impact by learning how its last global campaign performed across billions of multimedia impressions. Fortune 500 companies struggle to unlock the potential of data, so it’s no surprise that it’s been even harder for smaller businesses.

We developed Google BigQuery Service for large-scale internal data analytics. At Google I/O last year, we opened a preview of the service to a limited number of enterprises and developers. Today we're releasing some big improvements, and putting one of Google's most powerful data analysis systems into the hands of more companies of all sizes.
  • We’ve added a graphical user interface for analysts and developers to rapidly explore massive data through a web application.
  • We’ve made big improvements for customers accessing the service programmatically through the API. The new REST API lets you run multiple jobs in the background and manage tables and permissions with more granularity. 
  • Whether you use the BigQuery web application or API, you can now write even more powerful queries with JOIN statements. This lets you run queries across multiple data tables, linked by data that tables have in common.
  • It’s also now easy to manage, secure, and share access to your data tables in BigQuery, and export query results to the desktop or to Google Cloud Storage.

Michael J. Franklin, Professor of Computer Science at UC Berkeley, remarked that BigQuery (internally known as Dremel) leverages “thousands of machines to process data at a scale that is simply jaw-dropping given the current state of the art.” We’re looking forward to helping businesses innovate faster by harnessing their own large data sets. BigQuery is available free of charge for now, and we’ll let customers know at least 30 days before the free period ends. We’re bringing on a new batch of pilot customers, so let us know if your business wants to test-drive BigQuery Service.


Posted by Ju-Kay Kwek, Product Manager

Monday, November 7, 2011

App Engine 1.6.0 Out of Preview Release

Three and a half years after App Engine’s first Campfire One, App Engine has graduated from Preview and is now a fully supported Google product. We started out with the simple philosophy that App Engine should be ‘easy to use, easy to scale, and free to get started.’ And with 100 billion+ monthly hits, 300,000+ active apps, and 100,000+ developers using our product every month it’s clear that this philosophy resonates. Thanks to your support, Google is making a long term investment in App Engine!

When we announced our plans to leave preview earlier this year, we made a commitment to improving the service by adding support for Python 2.7, Premier Accounts and Backends as well as several changes launching today:

We are also holding a series of App Engine Office hours via Google+ this week for any users who have questions about how these changes impact their applications. The list of times can be found on the Google Developers events page, with links to join the hangout while the office hours are scheduled.  Also, please don’t hesitate to contact us at appengine_updated_pricing@google.com with any questions or concerns.

In addition to leaving Preview, we have several additional changes to announce today.

Production Changes
For billing enabled apps, we are offering two more scheduler controls and some additional changes:
  • Min Idle Instances: You can now adjust the minimum number of Idle Instances for your application, from 1 to 100. Users who had previously signed up for “Always On” can now set the number of idle instances for their applications using this setting.
  • Max Pending Latency: For applications that care about user facing latency, this slider  allows you to set a limit to the amount of time a request spends in the pending queue before starting up a new instance.
  • Blobstore API: You can now use the Blobstore API without signing up for billing.

Datastore Changes
  • High Replication Datastore Migration Tool: We are releasing an experimental tool that allows you to easily migrate your data from Master/Slave to High Replication Datastore, and seamlessly switch your application’s serving to the new HRD application.
  • Query Planning Improvements: We’ve published an article that details recent improvements to our query planner that eliminate the need for exploding indexes.
Python
  • MapReduce: We are releasing the full MapReduce framework in experimental for Python. The framework includes the Map, Shuffle, and Reduce phases.
  • Python 2.7 in the SDK: The SDK now supports the Python 2.7 runtime, so you can test out your changes before uploading them to production.
Java
  • Memcache API Improvements: The Memcache API for Java now supports asynchronous calls. Additionally, putIfUntouched() and getIdentifiable() now support batch operations.
  • Capability Testing: We’ve added the ability to simulate the capability state of local API implementations to test your application’s behavior if a service is unavailable.
  • Datastore Callbacks: You can now specify actions to perform before or after a put() or delete() call.
The full list of changes with this release can be found in the release notes (Python, Java). We’d love to hear your feedback about this release in the groups. And we’d like to thank you all for investing in our platform for the last three years. We’re excited for this milestone in App Engine history, and we look forward to what the future will bring.

Posted by The App Engine Team

Oracle and Java are registered trademarks of Oracle and/or its affiliates.

Thursday, October 27, 2011

ProdEagle - Analyzing your App Engine apps in real-time


Today’s post comes to us from Andrin von Rechenberg of MiuMeet who has developed an easy to use analysis framework, ProdEagle, for App Engine apps. ProdEagle enables you to easily count and visualize events to help better understand both performance and usage of your site.

ProdEagle allows you to monitor your system, lets you analyse in real-time what is going on and can alert you if something goes wrong.

The story

We are a small start-up that created an incredibly fast growing social dating platform called MiuMeet. We used App Engine from the start and within 6 months we had over 1 million registered users - the system scaled beautifully. One of the world’s leading online dating companies invested in our start-up and gave us a very good piece of advice: Before you build new features, you need to understand what’s actually going on in your system. Of course we thought we knew exactly what is going on in our system. We had the App Engine Dashboard and we had Google Analytics and we ran a couple of daily MapReduces to collect some statistics. But honestly, compared to today, we didn’t have any idea what was going on.

Understanding your system

From our MapReduces we knew that most users who accessed our dating platform used our Android app and only very few users used our iPhone app. We weren’t exactly sure what the reason for that was, but we wanted to improve the experience for our iPhone users. So we had to start measuring what was different for iPhone vs Android users.

But how do you do this? You start recording everything that happens in your system. And by “everything” we don’t mean the status code of an HTTP response but human understandable events, like “an Android user starts a conversation”, “the average length of a conversation”, “how often iPhone users log in” and so on. We realized that iPhone users are much less active and reply less often - not because they don’t want to, but because they don’t know immediately that they got a new message. Android users are informed by Push-Notifications, iPhone users via email. We did not expect that this would make such a big difference, but now it’s pretty obvious what the next feature we should build to improve the iPhone experience. Our conclusions are based on hard numbers and not guesses.

We used ProdEagle to collect and analyse the data that allowed us to realize this. ProdEagle can count, display and compare events in real-time and can even alert you if something is going wrong.

How to measure events

Let’s start with a simple Python example. We would like to record the devices from which messages are sent in our system. All we have to do is add the green lines to our sendMessage function:


import prodeagle

def sendMessage(from, to, text, device):
 mailbox = getMailbox(to)
 mailbox.addMessage(from, text)
 prodeagle.counter.incr("Message." + device)


This one line of code and a few clicks in the ProdEagle dashboard allows us to create the following graphs:



    *The numbers in these graphs are just examples.



Analysing your system in “real-time”

With ProdEagle you can count whatever you want. The advantage over traditional analytics systems is that you don’t have to wait for a day for the data to appear. Data appears within 1 minute in your dashboard. This means that you can monitor your system almost in real-time and set up alerts if something goes wrong.

In the following example we measure how long it takes to execute a datastore query:


def searchPeople(query):
 timestamp = time.time()
 query.execute()
 prodeagle.counter.incr("Search.People.Count")
 prodeagle.counter.incr("Search.People.Latency",
                        time.time() - timestamp)


In the ProdEagle Dashboard we can specify that “Search.People.Latency” should  be divided by “Search.People.Count” and plot it as a graph. Additionally we can set up an email alert that fires if the latency was more than 4000ms for 5 minutes:




If the latency is above all the red dots you get an email alert.

Being able to measure latency in real-time is particularly useful when you are trying to optimize your system and want to try out different strategies to answer queries.


The magic behind ProdEagle

Whenever you call prodeagle.counter.incr(counter_name)the ProdEagle library increments a counter in memcache -  this shouldn’t add any noticeable latency to your app, unless it is the first call made by a serving instance. Every few minutes, the ProdEagle service that we run collects all these memcache counters and persists them, creates the graphs and alerts you if something is wrong.

Obviously, the system isn’t robust if your memcache gets flushed. To address this, ProdEagle uses 1024 dummy counters to check when your memcache gets flushed and to approximate the inaccuracy of the data in your graphs. This is based on the assumption that elements with the same size in memcache are freed in a “least-recently-used” fashion and that the 1024 dummy counters hit all memcache shards. We have used ProdEagle for MiuMeet for a couple of months now and our approximated accuracy was never below 99.8%.


I like! How can I get it?

ProdEagle is currently completely free of charge. You can download the python client libraries and sign up for your dashboard on www.prodeagle.com.

Wednesday, October 19, 2011

App Engine SSL for Custom Domains in Testing

The long awaited SSL for Custom Domains is entering testing and we are now looking for trusted testers. If you are interested in signing up to test this feature, please fill in this form.

We will be offering two types of SSL service, Server Name Indication (SNI) and Virtual IP (VIP). SNI will be significantly less expensive than VIP when this service is launched, however unlike VIP it does not work in all browsers that support SSL. VIP is a premium service with a dedicated IP and full browser support. Both VIP and SNI support wildcard certificates and certificates with alternate names.

We look forward to making this widely available as soon as possible and as always we welcome your feedback in the group.


Posted by The App Engine Team

Tuesday, October 11, 2011

App Engine 1.5.5 SDK Release

2011 has seen some exciting releases for App Engine. As the days get shorter, the weather gets colder, and all that Halloween candy starts tempting everyone in the grocery store, we’ve been hard at work on our latest action packed release.

Premier Accounts
When choosing a platform for your most critical business applications, we recognize that uptime guarantees, easy management and paid support are often just as important as product features. So today we’re launching Google App Engine premier accounts.  For $500 per month (not including the cost to provision internet services), you’ll receive:
  • Premium support (see the 
  • Technical Support Services Guidelines for details).
  • A 99.95% uptime Service Level Agreement (see the draft agreement, the final agreement will be in the signed offline agreement).
  • The ability to create an unlimited number of apps on your premier account domain.
  • No minimum monthly fees per app. Pay only for the resources you use.
  • Monthly billing via invoice.
To sign up for a premier account, please contact our sales team at appengine_premier_requests@google.com.  


Python 2.7
PIL? NumPy? Concurrent requests? Python 2.7 has it all, and today we’re opening up Python 2.7 as an experimental release. We’ve put together a list of all the known differences between the current 2.5 runtime and the new runtime.


Overall Changes
We know that bumping up against hard limits can be frustrating, and we’ve talked all year about our continued push to lift our system limits. With this release we are raising several of these:
  • Request Duration: The frontend request deadline has been increased from 30 seconds to 60 seconds. We’ve increased the maximum URLFetch deadline to match from 10 seconds to 60 seconds.
  • File limits: We’ve increased the number of files you can upload with your application from 3,000 to 10,000 files, and the file size limit has also been increased from 10MB to 32MB.
  • API Limits: Post payloads for URLFetches are now capped at 5MB instead of 1MB.
We’re also announcing several limited preview features and trusted tester programs:
  • Cloud SQL Preview: We announced last week that we are offering a preview of SQL support in App Engine. Give it a try and let us know what you think.
  • Full-text Search: We are looking for early trusted testers for our long anticipated Full-Text Search API. Please fill out this form if you’re interested in trying it out.
  • Conversion API: Ever wanted to convert from text to PDF in your App? Then consider signing up as a trusted tester for the Conversion API.
Datastore
  • Cross Group (XG) Transactions: For those who need transactional writes to entities in multiple entity groups (and that's everyone, right?), XG Transactions are just the thing. This feature uses two phase commit to make cross group writes atomic just like single group writes.
Platform Improvements
Of course, these are just the high level changes. This release is packed full of features and bug fixes, and as always, we welcome your feedback in the group.