Problems worthy of attack prove their worth by hitting back. —Piet Hein

Saturday, 22 March 2008

Learning MapReduce

Update: I've posted my answers to the exercises. Let me know if you find any mistakes. Also: Tamara Petroff has posted a write up of the session.

On Wednesday [19 March], I ran a session at SPA 2008 entitled "Understanding MapReduce with Hadoop". SPA is a very hands-on conference, with many sessions having a methodological slant, so I wanted to get people who had never encountered MapReduce before actually writing MapReduce programs. I only had 75 minutes, so I decided against getting people coding on their laptops. (In hindsight this was a good decision, as I went to several other sessions where we struggled to get the software installed.) Instead, we wrote MapReduce programs on paper, using a simplified notation.

It seemed to work. For the first half hour, I gave as minimal an introduction to MapReduce as I could, then the whole group spent the next half hour working in pairs to express the solutions to a number of exercises as MapReduce programs. We spent the last 15 minutes comparing notes and discussing some of the solutions to the problems.

There were six exercises, presented in rough order of difficulty, and I'm pleased to say that every pair managed to solve at least one. Here's some of the feedback I got:
  • Some struggled to know what input data formats to use. Perhaps I glossed over this too much - I didn't want people to worry about precisely how the data was encoded - but I could have emphasised more that you can have the data presented to your map function in any way that's convenient.
  • While most people understood the notation I used for writing the map and reduce functions, it did cause some confusion. For example, someone wanted to see the example code again so they could understand what was going on. And another person said it took a while to realise that they could do arbitrary processing as a part of the map and reduce functions. It would be interesting to do the session again but using Java notation.
  • It was quite common for people to try to do complex things in their map and reduce functions - they felt bad if they just used an identity function, because it was somehow a waste. And on a related note, chaining map reduce jobs together wasn't obvious to many. But once pointed out, folks had an "aha!" moment and were quick to exploit it.
  • The fact that you typically get multiple reduce outputs prompted questions from some - "but how do you combine them into a single answer?". Talking about chained MapReduce helped here again.
  • Everyone agreed that it wasn't much like functional programming.
You can find the slides on the Hadoop wiki. They include the six exercises, which I've reproduced below, in rough order of difficulty. (I'll post my answers next week.)
  1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.
  2. Find the pages with over 1 million hits in day for a website given its access logs.
  3. Find the pages that link to each page in a collection of webpages.
  4. Calculate the proportion of lines that match a given regular expression for a collection of documents.
  5. Sort tabular data by a primary and secondary column.
  6. Find the most popular pages for a website given its access logs.
Is this a good list of exercises? Do you have any exercises that you've found useful for learning MapReduce?

Finally, thanks to Robert Chatley for being a guinea pig for the exercises, and for helping out on the day with participants' questions during the session.

2 comments:

Unknown said...

Hi Tom,

In the slides, you described
a Sort as:
Map: (k, v) -> [(k, v)]
Reduce: (k, [v1, v2, ...]) -> [(k, v1), (k, v2), ...]

Can you give more info. What is the data that it is sorting?
Perhaps if you can work through an example or give pseduocode for a Java implementation it would be great.

Thanks for all the info you have provided. I have your Hadoop book
and am working through it to learn Hadoop.

regards,
David
davidsuperca@gmail.com

Unknown said...

hi,

regarding my question about the sort.
I think I figured it out. I think by default hadoop sorts the keys by default. I think your code notation uses that fact to do the sorting.