Friday, June 5, 2009

10 things you (probably) didn't know about App Engine

What could be better than nine nifty tips and tricks about App Engine? Why, ten of course. As we've been participating in the discussion groups, we've noticed that some features of App Engine often go unnoticed so we've come up with just under eleven fun facts which might just change the way that you develop your app. Without further ado, bring on the first tip:

1. App Versions are strings, not numbers

Although most of the examples show the 'version' field in app.yaml and appengine-web.xml as a number, that's just a matter of convention. App versions can be any string that's allowed in a URL. For example, you could call your versions "live" and "dev", and they would be accessible at "live.latest.yourapp.appspot.com" and "dev.latest.yourapp.appspot.com".

2. You can have multiple versions of your app running simultaneously

As we alluded to in point 1, App Engine permits you to deploy multiple versions of your app and have them running side-by-side. All the versions share the samedatastore and memcache, but they run in separate instances and have different URLs. Your 'live' version always serves off yourapp.appspot.com as well as any domains you have mapped, but all your app's versions are accessible at version.latest.yourapp.appspot.com. Multiple versions are particularly useful for testing a new release in a production environment, on real data, before making it available to all your users.

Something that's less known is that the different app versions don't even have to have the same runtime! It's perfectly fine to have one version of an app using the Java runtime and another version of the same app using the Python runtime.

3. The Java runtime supports any language that compiles to Java bytecode

It's called the Java runtime, but in fact there's nothing stopping you from writing your App Engine app in any other language that compiles to JVM bytecode. In fact, there are already people writing App Engine apps in JRuby, Groovy, Scala, Rhino (a JavaScript interpreter), Quercus (a PHP interpreter/compiler), and even Jython! Our community has shared notes on what they've found to work and not work on the following wiki page.

4. The 'IN' and '!=' operators generate multiple datastore queries 'under the hood'

The 'IN' and '!=' operators in the Python runtime are actually implemented in the SDK and translate to multiple queries 'under the hood'.

For example, the query "SELECT * FROM People WHERE name IN ('Bob', 'Jane')" gets translated into two queries, equivalent to running "SELECT * FROM People WHERE name = 'Bob'" and "SELECT * FROM People WHERE name = 'Jane'" and merging the results. Combining multiple disjunctions multiplies the number of queries needed, so the query "SELECT * FROM People WHERE name IN ('Bob', 'Jane') AND age != 25" generates a total of four queries, for each of the possible conditions (age less than or greater than 25, and name is 'Bob' or 'Jane'), then merges them together into a single result set.

The upshot of this is that you should avoid using excessively large disjunctions. If you're using an inequality query, for example, and you expect only a small number of records to exactly match the condition (e.g. in the above example, you know very few people will have an age of exactly 25), it may be more efficient to execute the query without the inequality filter and exclude any returned records that don't match it yourself.

5. You can batch put, get and delete operations for efficiency

Every time you make a datastore request, such as a query or a get() operation, your app has to send the request off to the datastore, which processes the request and sends back a response. This request-response cycle takes time, and if you're doing a lot of operations one after the other, this can add up to a substantial delay in how long your users have to wait to see a result.

Fortunately, there's an easy way to reduce the number of round trips: batch operations. The db.put(), db.get(), and db.delete() functions all accept lists in addition to their more usual singular invocation. When passed a list, they perform the operation on all the items in the list in a singledatastore round trip and they are executed in parallel, saving you a lot of time. For example, take a look at this common pattern:

for entity in MyModel.all().filter("color =",
    old_favorite).fetch(100):
  entity.color = new_favorite
  entity.put()

Doing the update this way requires one datastore round trip for the query, plus one additional round trip for each updated entity - for a total of up to 101 round trips! In comparison, take a look at this example:

updated = []
for entity in MyModel.all().filter("color =",
    old_favorite).fetch(100):
  entity.color = new_favorite
  updated.append(entity)
db.put(updated)

By adding two lines, we've reduced the number of round trips required from 101 to just 2!

6. Datastore performance doesn't depend on how many entities you have

Many people ask about how the datastore will perform once they've inserted 100,000, or a million, or ten million entities. One of the datastore's major strengths is that its performance is totally independent of the number of entities your app has. So much so, in fact, that every entity for every App Engine app is stored in a singleBigTable table! Further, when it comes to queries, all the queries that you can execute natively (with the notable exception of those involving 'IN' and '!=' operators - see above) have equivalent execution cost: The cost of running a query is proportional to the number of results returned by that query.

7. The time it takes to build an index isn't entirely dependent on its size

When adding a new index to your app on App Engine, it sometimes takes a significant amount of time to build. People often inquire about this, citing the amount of data they have compared to the time taken. However, requests to build new indexes are actually added to a queue of indexes that need to be built, and processed by a centralized system that builds indexes for all App Engine apps. At peak times, there may be other index building jobs ahead of yours in the queue, delaying when we can start building your index.

8. The value for 'Stored Data' is updated once a day

Once a day, we run a task to recalculate the 'Stored Data' figure for your app based on your actual datastore usage at that time. In the intervening period, we update the figure with an estimate of your usage so we can give you immediate feedback on changes in your usage. This explains why many people have observed that after deleting a large number of entities, theirdatastore usage remains at previous levels for a while. For billing purposes, only the authoritative number is used, naturally.

9. The order that handlers in app.yaml, web.xml, and appengine-web.xml are specified in matters

One of the more common and subtle mistakes people make when configuring their app is to forget that handlers in the application configuration files are processed in order, from top to bottom. For example, when installing remote_api, many people do the following:

handlers:
- url: /.*
  script: request.py

- url: /remote_api
  script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
  login: admin

The above looks fine at first glance, but because handlers are processed in order, the handler for request.py is encountered first, and all requests - even those for remote_api - get handled by request.py. Since request.py doesn't know about remote_api, it returns a 404 Not Found error. The solution is simple: Make sure that the catchall handler comes after all other handlers.

The same is true for the Java runtime, with the additional constraint that all the static file handlers in appengine-web.xml are processed before any of the dynamic handlers in web.xml.

10. You don't need to construct GQL strings by hand

One anti-pattern that comes up a lot looks similar to this:

q = db.GqlQuery("SELECT * FROM People "
    "WHERE first_name = '" + first_name 
    + "' AND last_name = '" + last_name + "'")

As well as opening up your code to injection vulnerabilities, this practice introduces escaping issues (what if a user has an apostrophe in their name?) and potentially, encoding issues. Fortunately,GqlQuery has built in support for parameter substitution, a common technique for avoiding the need to substitute in strings in the first place. Using parameter substitution, the above query can be rephrased like this:

q = db.GqlQuery("SELECT * FROM People "
    "WHERE first_name = :1 "
    "AND last_name = :2", first_name, last_name)

GqlQuery also supports using named instead of numbered parameters, and passing a dictionary as an argument:

q = db.GqlQuery("SELECT * FROM People "
    "WHERE first_name = :first_name "
    "AND last_name = :last_name", 
    first_name=first_name, last_name=last_name)

Aside from cleaning up your code, this also allows for some neat optimizations. If you're going to execute the same query multiple times with different values, you can useGqlQuery .bind() to 'rebind' the values of the parameters for each query. This is faster than constructing a new query each time, because the query only has to be parsed once:

q = db.GqlQuery("SELECT * FROM People "
    "WHERE first_name = :first_name "
    "AND last_name = :last_name")
for first, last in people:
  q.bind(first, last)
  person = q.get()
  print person

Java is a trademark or registered trademark of Sun Microsystems, Inc. in the United States and other countries.

No comments: