Tom White: January 2008

Wednesday 30 January 2008

Hadoop and Log File Analysis

I've always thought that Hadoop is a great fit for analyzing log files (I even wrote an article about it). The big win is that you can write ad hoc MapReduce queries against huge datasets and get results in minutes or hours. So I was interested to read Stu Hood's recent post about using Hadoop to analyze email log data:

Here at Mailtrust, Rackspace’s mail division, we are taking advantage of Hadoop to help us wrangle several hundred gigabytes of email log data that our mail servers generate each day. We’ve built a great tool for our support team that lets them search mail logs in order to troubleshoot problems for customers. Until recently, this log search and storage system was centered around a traditional relational database, which worked fine until the exponential growth in the volume of our dataset overcame what a single machine could cope with. The new logging backend we’ve developed based on Hadoop gives us virtually unlimited scalability.

The best bit was when they wrote a MapReduce query to find the geographic distribution of their users.

This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.

It's great when a technology has the ability to make such a positive contribution to your business. In Doug Cutting's words, it is "transformative".

Can we take this further? It seems to me that there is a gap in the market for an open source web traffic analysis tool. Think Google Analytics where you can write your own queries. I wonder who's going to build such a thing?

Wednesday 16 January 2008

Hadoop is now an Apache Top Level Project

Doug Cutting just reported on the Hadoop lists that the Apache board voted this morning (US time) to make Hadoop a TLP. Until now it has been a Lucene subproject, which made sense when Hadoop was broken out from the Nutch codebase two years ago. Since then Hadoop has grown dramatically. This change will make it possible for a number of associated projects - such as HBase and Pig - to become Hadoop subprojects in their own right. (There are more details in the original discussion.)

Thanks for looking after us Lucene - it's been a great time so far and we'll keep in touch!

Sunday 13 January 2008

MapReduce, Map Reduce, Map/Reduce or Map-Reduce?

Although I've seen the other variants (and used some of them myself), Google call it "MapReduce", so that seems like the right thing to call it to me, since they invented it. The usage figures seem to back up this conclusion. "MapReduce" (no space) has 87,000 Google hits, while "Map Reduce" (with space) has only 50,200, and the latter includes "Map/Reduce" and "Map-Reduce" variants, since Google (and search engines in general) ignore punctuation.

In this age of case sensitivity and camel case one has to watch out for these things. In fact, I've only just realised that the Hadoop database is called "HBase", not "Hbase". The curse of wiki names. And the logo doesn't help either - it's all caps!

Monday 7 January 2008

Casual Large Scale Data Processing

I think Greg Linden hits the nail on the head when he says of MapReduce at Google:

What is so remarkable about this is how casual it makes large scale data processing. Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time.

I think this is a good way of looking at what Hadoop is trying to achieve - to make large scale data processing as natural as small scale data processing. Hadoop can provide infrastructure to do this, but there is also a need for open datasets that are MapReducable. Imagine if I could run a MapReduce program against Google's web crawl database, Amazon's sales data or Facebook's social graph.