Thursday, February 19, 2009

Back to the Future for Data Storage

Building a massive, distributed datastore which can service requests at an extremely high throughput is something that we've focused on at Google. We created something called Bigtable that underlies the datastore in App Engine. The design for Bigtable focused on scalability across a distributed system so it may operate a bit differently than databases you've worked with before, such as not supporting joins. This isn't an accident -- when you build a system that can scale to the size that Bigtable can there's no way to do a general purpose join on data sets that size and still have them be performant.

Google isn't alone in offering an non-Relational datastore to enable scaling. For example, Amazon has SimpleDB:

A traditional, clustered relational database requires a sizable upfront capital outlay, is complex to design, and often requires a DBA to maintain and administer. Amazon SimpleDB is dramatically simpler, requiring no schema, automatically indexing your data and providing a simple API for storage and access.

There are also a range of non-relational open source datastores now available such as CouchDB and Hypertable. Those are just two examples, there are many more.

While you might think this is all new, it's actually a bit of a return to the past. You see, there was a time when "RDBMS" wasn't always the answer regardless of what the question was. At the time Codd published his paper, "A Relational Model of Data for Large Shared Data Banks," there were many different approaches to datastores. It was only in the '80s that relational databases won the majority of the mindshare. Having settled on a single metaphor the industry has developed many tools and techniques to make developing on a relational database easier.

Unfortunately that majority mindshare is also a problem because while RDBMS' are useful in many situations, they are not useful in all situations. Their dominance in the mindshare means that useful alternatives aren't used, and huge amounts of time and money can be wasted trying to force non-relational problems into a relational model.

We are in the middle of a renaissance in data storage with the application of many new ideas and techniques; there's huge potential for breaking out of thinking about data storage in just one way. Michael Stonebraker pointed out in his paper, "One Size Fits All": An Idea Whose Time Has Come and Gone, that there are common datastore use cases, such as Data Warehousing and Stream Processing that are not well served by a general purpose RDBMS and that abandoning the general purpose RDBMS can give you a performance increase of one or two orders of magnitude.

It's an exciting time, and the takeaway here isn't to abandon the relational database, which is a very mature technology that works great in its domain, but instead to be willing to look outside the RDBMS box when looking for storage solutions.

No comments: