Problems worthy of attack prove their worth by hitting back. —Piet Hein

Thursday 3 July 2008

Hadoop beats terabyte sort record

Hadoop has beaten the record for the terabyte sort benchmark, bringing it from 297 seconds to 209. Owen O'Malley wrote the MapReduce program (which by the way has a clever partitioner to ensure the reducer outputs are globally sorted and not just sorted per output partition, which is what the default sort does), and then ran it on 910 nodes on Yahoo!'s cluster. There are more details in Owen's blog post (and there's a link to the benchmark page which has a PDF explaining his program). You can also look at the code in trunk.

Well done Owen and well done Hadoop!

1 comment:

maverick said...

Hi, can Hadoop be used for sorting, merging and summarizing ascii files on Linux?

My organization is replatforming to linux and need sort utility for processing the input files before putting them into the batch stream. currently we use syncsort and was wondering if there is an open source utility.

Please advise