Tom White: Hadoop Query Languages

Friday, 20 June 2008

Hadoop Query Languages

If you want a high-level query language for drilling into your huge Hadoop dataset, then you've got some choice:

Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.
Jaql, from IBM and soon to be open sourced, is a declarative query language for JSON data.
Hive, from Facebook and soon to become a Hadoop contrib module, is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.

All three projects have different strengths, but there is plenty of scope for collaboration and cross-pollination, particularly in the query language. For example, at the Hadoop Summit in March, Joydeep Sen Sarma of Facebook said that they would be receptive to users who wanted to use Pig Latin or Jaql in Hive. And Kevin Beyer of IBM Research said that Pig and Jaql are converging, and they've had discussions with the Pig team about how to bring them even closer together.

Meanwhile, to learn more I recommend Pig Latin: A Not-So-Foreign Language for Data Processing (by Chris Olston et al), and the slides and videos from the Hadoop Summit.

(And I haven't even included Cascading, from Chris K. Wensel, which, while not a query language per se, is an abstraction built on MapReduce for building data processing flows in Java or Groovy using a plumbing metaphor with constructs such as taps, pipes, and flows. Well worth a look too.)

4 comments:

Theodore Omtzigt said...: Tom:

Have you looked at the SQL/DB research out there? I became aware of Berkeley's Telegraph project for federated data queries, and CWI's MonetDB project for column oriented data. Both projects also reason about nebulous data sets created by sensor networks and it just give me a new sense of the wonderful world of information processing on a global scale.

Theo; 25 June 2008 at 01:44
Anonymous said...: Yo,

Hive hit trunk recently, might be worth updating with a link to the JIRA ticket?

Later,
Jeff; 26 August 2008 at 02:25
Tom White said...: Theo, Thanks for the pointers - I'll have a look at them.

Jeff, Hive isn't quite in Hadoop trunk yet. The source has been released though: I've been having a play with it and I like it. (The Jira ticket is linked from the entry BTW, but here it is again: HADOOP-3601).

Tom; 26 August 2008 at 09:35
Anonymous said...: Tom,

Business.com has developed CloudBase- a data warehouse system on top of Hadoop that supports ANSI SQL as its query language.

CloudBase-1.1 was released last week-

http://cloudbase.sourceforge.net

CloudBase creates a database system directly on flat files and converts input ANSI SQL expressions into map-reduce programs for processing flat files. It has an optimized algorithm to handle Joins and plans to support table indexing in next release.

CloudBase comes with a JDBC driver so one can use third party BI tools, reporting frameworks to directly connect to CloudBase.

-Tarandeep; 29 December 2008 at 20:23