Thursday, October 27, 2011

ProdEagle - Analyzing your App Engine apps in real-time


Today’s post comes to us from Andrin von Rechenberg of MiuMeet who has developed an easy to use analysis framework, ProdEagle, for App Engine apps. ProdEagle enables you to easily count and visualize events to help better understand both performance and usage of your site.

ProdEagle allows you to monitor your system, lets you analyse in real-time what is going on and can alert you if something goes wrong.

The story

We are a small start-up that created an incredibly fast growing social dating platform called MiuMeet. We used App Engine from the start and within 6 months we had over 1 million registered users - the system scaled beautifully. One of the world’s leading online dating companies invested in our start-up and gave us a very good piece of advice: Before you build new features, you need to understand what’s actually going on in your system. Of course we thought we knew exactly what is going on in our system. We had the App Engine Dashboard and we had Google Analytics and we ran a couple of daily MapReduces to collect some statistics. But honestly, compared to today, we didn’t have any idea what was going on.

Understanding your system

From our MapReduces we knew that most users who accessed our dating platform used our Android app and only very few users used our iPhone app. We weren’t exactly sure what the reason for that was, but we wanted to improve the experience for our iPhone users. So we had to start measuring what was different for iPhone vs Android users.

But how do you do this? You start recording everything that happens in your system. And by “everything” we don’t mean the status code of an HTTP response but human understandable events, like “an Android user starts a conversation”, “the average length of a conversation”, “how often iPhone users log in” and so on. We realized that iPhone users are much less active and reply less often - not because they don’t want to, but because they don’t know immediately that they got a new message. Android users are informed by Push-Notifications, iPhone users via email. We did not expect that this would make such a big difference, but now it’s pretty obvious what the next feature we should build to improve the iPhone experience. Our conclusions are based on hard numbers and not guesses.

We used ProdEagle to collect and analyse the data that allowed us to realize this. ProdEagle can count, display and compare events in real-time and can even alert you if something is going wrong.

How to measure events

Let’s start with a simple Python example. We would like to record the devices from which messages are sent in our system. All we have to do is add the green lines to our sendMessage function:


import prodeagle

def sendMessage(from, to, text, device):
 mailbox = getMailbox(to)
 mailbox.addMessage(from, text)
 prodeagle.counter.incr("Message." + device)


This one line of code and a few clicks in the ProdEagle dashboard allows us to create the following graphs:



    *The numbers in these graphs are just examples.



Analysing your system in “real-time”

With ProdEagle you can count whatever you want. The advantage over traditional analytics systems is that you don’t have to wait for a day for the data to appear. Data appears within 1 minute in your dashboard. This means that you can monitor your system almost in real-time and set up alerts if something goes wrong.

In the following example we measure how long it takes to execute a datastore query:


def searchPeople(query):
 timestamp = time.time()
 query.execute()
 prodeagle.counter.incr("Search.People.Count")
 prodeagle.counter.incr("Search.People.Latency",
                        time.time() - timestamp)


In the ProdEagle Dashboard we can specify that “Search.People.Latency” should  be divided by “Search.People.Count” and plot it as a graph. Additionally we can set up an email alert that fires if the latency was more than 4000ms for 5 minutes:




If the latency is above all the red dots you get an email alert.

Being able to measure latency in real-time is particularly useful when you are trying to optimize your system and want to try out different strategies to answer queries.


The magic behind ProdEagle

Whenever you call prodeagle.counter.incr(counter_name)the ProdEagle library increments a counter in memcache -  this shouldn’t add any noticeable latency to your app, unless it is the first call made by a serving instance. Every few minutes, the ProdEagle service that we run collects all these memcache counters and persists them, creates the graphs and alerts you if something is wrong.

Obviously, the system isn’t robust if your memcache gets flushed. To address this, ProdEagle uses 1024 dummy counters to check when your memcache gets flushed and to approximate the inaccuracy of the data in your graphs. This is based on the assumption that elements with the same size in memcache are freed in a “least-recently-used” fashion and that the 1024 dummy counters hit all memcache shards. We have used ProdEagle for MiuMeet for a couple of months now and our approximated accuracy was never below 99.8%.


I like! How can I get it?

ProdEagle is currently completely free of charge. You can download the python client libraries and sign up for your dashboard on www.prodeagle.com.

3 comments:

Edward Hartwell Goose said...

The first version of ProdEagle for Java is also now available.

See here: http://code.google.com/p/prodeagle-java/

Blogger said...

"It's completely free for now... (we might have to add billing later)"

Any idea about the cost if billing is added?

Matthias Koch said...

first version for GO (Golang) is ready

https://github.com/fmt-Println-MKO/prodeagle-go

feedback / improvements are always welcome :)