Friday, October 22, 2010

Research Project: AppScale at University of California, Santa Barbara

The following post is a guest post by Chris Bunch, a Computer Science Ph.D. student at the University of California, Santa Barbara. He is one of the student leads on the AppScale project, an open source Google App Engine compatible hosting solution led by Professor Chandra Krintz. Chris has developed and maintained AppScale as a research project over the last two years with fellow student lead Navraj Chohan and others.

---------

Over here at the UCSB Racelab, we've complained endlessly about finding a web framework we actually could use. For a long time we thought we just wouldn't be able to find it - many were so-so or good but only after a substantial learning curve. So imagine our surprise back in April 2008 when we heard about what we thought would be just-another-web-framework provided by Google in the Python version of App Engine. But after giving it a try, we were smitten. We finally found a web framework that (1) we could actually use on non-trivial projects and (2) we could teach in nine-week classes without having students lose half the time with the idiosyncrasies of the programming language involved or the web framework itself. Furthermore, the minimalistic APIs make it simple to get work done: it did for us exactly what we needed and nothing else.

Yet as researchers and hackers-at-heart there was one thing that we really wanted to do with App Engine that we couldn't do: run it on a whole bunch of our machines and tinker with it. A similarly-minded hacker named Chris Anderson had released AppDrop, which was a modified version of the App Engine SDK that hooked up to PostgresSQL and run in Amazon EC2, but only ran over a single machine. So after much discussion, we came up with the following short list of things we wanted to do with App Engine:

  • We wanted to run it on our own virtual machines or those running in Eucalyptus or Amazon EC2 in order to investigate how we can optimally harness cloud infrastructures in our cloud platform.
  • Tons of new datastores have emerged as part of the "NoSQL" movement, and we need a mechanism to evaluate their performance under controlled experiments as well as traditional databases such as MySQL. We also need a platform that supports the ability to add new data storage mechanisms so that when developers tout the features of their new datastore, we can download it and evaluate it under similar circumstances as other datastores.
  • One of the reasons we love Google App Engine is the simple set of APIs provided, but we also wanted to use that as a starting point where we could add new APIs and control the environment in which they run.
  • We love that Google App Engine "just works". You don't know where it's running and how it's running, but you can see that it is running, and we wanted to make sure that whatever we developed, that it did the same. We wanted to develop something that automatically deployed your App Engine app and configured everything for you. Expert users should be able to have more control over the system, but the system should be able to handle your app from the moment you deploy it to the moment you tear it down.
  • It had to be open-source - just like how we wanted something to tinker with and run experiments on, we wanted it to be something that you could tinker with too. We wanted you to be able to add in support for a database you're interested in and see how it performs, and we wanted you to be able to add in APIs that you think would be interesting to have an easy-to-use web framework interact with.

So with that in mind, we created AppScale, an open-source cloud platform for Google App Engine applications. Here's how we did it:

We took the standard three-tier web deployment approach and clearly segmented each tier into a specific component in the system: an AppLoadBalancer routes users to their applications, an AppServer runs the user's App Engine app, and an AppDB handles database interactions. Each have clearly defined roles in the system and are controlled by an AppController, a daemon that runs on each machine, monitors each component, and controls the specific order in which services are started. It writes all the configuration files for each service and coordinates services between the other AppControllers in the deployment. For those interested, we detail the specifics on the original AppScale implementation in this paper.

We also wanted to embody the principle of "standing on the shoulders of giants", and as such, we employ open-source software as often as possible, where appropriate. Our AppLoadBalancer employs the nginx web server as well as the haproxy load balancer to ensure high performance. Our Memcache API implementation uses memcache under the hood, while our MapReduce API uses Apache Hadoop, which we added to give App Engine users running over AppScale the ability to run Hadoop MapReduce jobs from within their web applications.

Because we were able to keep the database support abstracted away from the other components in the system, we were able to add support for nine different data storage solutions within AppScale: HBase, Hypertable, MySQL, Cassandra, Voldemort, MongoDB, MemcacheDB, Scalaris, and SimpleDB. Many of these databases have seen interest in recent years but have been hard to measure under comparable conditions, and vary greatly. To give a few examples, they vary in the query languages they provide, their topologies (e.g., master / slave, peer-to-peer), data consistency policies, and end-user library interfaces. This has made it non-trivial for the community to objectively determine scenarios in which one database performs better or worse than another and investigate why, but under AppScale, deploying all these databases is done automatically with no interaction from the user. And because AppScale is open-source, if a developer doesn't like the particular interface we use for a database, they can improve on it and give back to the community. We've used AppScale internally to evaluate the performance of Google App Engine applications on these datastores as well as developed an App Engine app, Active Cloud DB, that exposes a RESTful API that developers can use to access these datastores from any programming language or web framework.

Finally, the most important lesson we learned was the value of incremental development. Our core development team fluctuates between two to three developers, so from the first meeting we had, we knew that our very first release couldn't support every App Engine API nor could it run nine databases seamlessly. Therefore, we started off with support for the two BigTable clones, HBase and Hypertable, as well as support for just the Datastore API, the URL Fetch API, and the Users API within App Engine. From there, we learned what datastores people actually wanted to see support for as well as what APIs people wanted to use. We were also able to add APIs within App Engine apps deployed to AppScale to be able to run virtual machines under the EC2 API, while also running computation under the MapReduce API.

But developing AppScale was certainly not a cakewalk for us. Over the course of the last two years, five major issues (some technical and some not) have arisen within the project:

  1. Writing software that works without knowing ahead of time how many machines will be in the system proved initially to be difficult to grasp, but in many cases we were able to reduce the number of variations that could occur and use that to provide some predictability with respect to how we configure and deploy databases and applications.
  2. We couldn't assume that the AppScale administrator has access to DNS; without it, a number of APIs and features are extremely difficult to implement. Load balancing is much more difficult, and many APIs that are tied to host names must be tied to one machine in the system, else they don't work properly. VLAN tagging shows some promise to alleviate these problems, but right now is far from being deployed inexpensively and easily.
  3. The source code for the Java version of App Engine isn't publicly available, so we had to spend a lot of time decompiling the SDK, modifying it to use our database and our API implementations instead of the SDK implementations, and recompiling it. All of these were non-trivial and greatly added to the time it took for us to deploy a version of AppScale with Java App Engine support.
  4. Not all users want a pre-built virtual machine image, so ensuring that building the AppScale environment was done right every time was a top priority. We had to limit ourselves to Ubuntu Jaunty for many releases, and only recently were we able to expand to include Karmic and Lucid, which still make up a microcosm of the distributions available in the Linux world. Adding the ability to install AppScale via apt-get in these specific Linux distributions has also been a crucial step in making sure that users could easily and quickly install AppScale for use.
  5. Both undergraduates and graduate students here at UCSB have done projects involving AppScale, which means that the number and experience levels of developers working on AppScale is completely unpredictable at a given moment in time. Oftentimes the projects they work on are only tangentially related to features that users want, and the time scales that they are available to work for is vastly different than most software engineers are used to.

All of these problems are greatly exacerbated by only having a two-to-three person core developer team, but this also makes the AppScale project particularly interesting to work on. Despite having worked on AppScale for two years, there are still tons of interesting problems to work on and we still love the Python App Engine web framework as much as we did when we first picked it up. And of course, AppScale is open-source, under the New BSD License, so feel free to download it and tinker around like we have! Check out AppScale at:

http://appscale.cs.ucsb.edu

http://code.google.com/p/appscale

-- Chris

7 comments:

Fernando said...

Chris, that's a fascinating story. I love the concepts underlying the App Engine platform, and I think your project has a great merit. As with any "clone" projects, I suspect one of the issues is that App Engine is a moving target, constantly adding new features and tweaks.

Chris Bunch said...

Hi Fernando,

That's definitely a big issue - there's a constant struggle between updating to the newest version for both the Python and Java App Engines and adding in our own features for our own experimentation. But then again, that's why it's a fun project to work on :)

Heiko Roth (EGOTEC GmbH) said...

Great work.
Hold on.

Jens said...

This is a very interesting project!!! Please keep us up-to-date about progress.

I ran into a couple of issues with the GAE SDK that could have been fixed easily if their SDK would have been open source. This gives us a new level of contribution and perhaps the GAE team a push to make their stuff open source...

Anyway great project, love to see more
What I did not quite get from your post is this: On one hand you claim support for 8 different databases. On the other hand you mention on the project page "The current version of AppScale does not provide data persistence." Is there somewhere a place where I can find more information about the exact limitations today? I also wonder how you make these very different databases behave in the same way? Is your intent to implement GAE low-level datastore API for all of them? OR will JPA, JDO the common API?

Chris Bunch said...

Hi Jens,

With respect to data persistence, what we specifically mean is that we don't yet have a way to port your data in and out of AppScale - that is, like the Bulk Loader for App Engine. So if you terminate your AppScale instance, you lose all data that was running in it. Of course, if you always left the boxes running, it would be persistently written to the disks on those machines.

With respect to the other limitations in the system, they're pretty much identical to the App Engine SDK's limitations - our Google Code page cites additional limitations as needed (e.g., no Blobstore API for AppScale yet).

Finally, we also have a wiki page on the Google Code site named "Default Database Configuration" that outlines the basics of how we make the databases play nicely with App Engine and two peer-reviewed conference papers that detail it more fully for the adventurous reader. To make a long story short, we don't do anything specific at the App Engine layer - it forwards all its internal requests to a specific server within AppScale that translates these requests on a per-database basis.

Thanks for the great questions and the interest in AppScale!

Jens said...

Thanks for clarification Chris. I will take a look at the links you have provided. This is very interresting and promising!

Luke Vorster said...

This is probably the most inspiring research I have seen since the late 1990's !!!!!

Go! Go! Go!

I am now researching in the AI field (left the commercial industry shortly after 9/11), and have reached a roof limit of computational feasibility regarding techniques for transparently configuring parallel platforms for a general set of problems to be solved by a set of AI heuristics...

Grid technology falls short (to date), and so does most of the cloud platforms I have looked at.

The way you have leveraged the most scalable, reliable, web technology (GoogleApp) to date, and then put it on an open source cloud platform that will allow the GoogleApp to scale __even_ further has kept me up for the previous three nights!!!

I can't get enough, and won't stop until I can build a mini appscale in my lab...

WELL DONE!!!