Problems worthy of attack prove their worth by hitting back. —Piet Hein

Tuesday 18 March 2008

"Disks have become tapes"

MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data).

So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20% [1]. Seek time is growing more slowly than transfer rate - so it pays to use a model that operates at the transfer rate. Which is what MapReduce does. I first saw this observation in Doug Cutting's talk, with Eric Baldeschwieler, at OSCON last year, where he worked through the numbers for updating a 1 terabyte database using the two paradigms B-Tree (seek-limited) and Sort/Merge (transfer-limited). (See the slides and video for more detail.)

The general point was well summed up by Jim Gray in an interview in ACM Queue from 2003:
... programmers have to start thinking of the disk as a sequential device rather than a random access device.
Or the more pithy: "Disks have become tapes." (Quoted by David DeWitt.)

But even the growth of transfer rate is dwarfed by another measure of disk drives - capacity, which is growing at about 50% a year. David DeWitt argues that since the effective transfer rate of drives is falling we need database systems that work with this trend - such as column-store databases and wider use of compression (since this effectively increases the transfer rate of a disk). Of existing databases he says:
Already we see transaction processing systems running on farms of mostly empty disk drives to obtain enough seeks/second to satisfy their transaction processing rates.
But this applies to transfer rate too (or if it doesn't yet, it will). Replace "seeks" with "transfers" and "transaction processing" with "MapReduce" and I think over time we'll start seeing Hadoop installations that choose to use large numbers of smaller capacity disks to maximize their processing rates.

[1] See Trends in Disk Technology by Michael D. Dahlin for changes between 1987-1994. For the period since then these figures still hold - as it's relatively easy to check using manufacturer's data sheets, although with seek time it's harder to tell since the definitions seem to change from year to year and from manufacturer to manufacturer. Still, 5% is generous.

17 comments:

rektide said...

IOPs have gone up two orders of magnitude in the past three years. Currently transfer rates for SSD are usually below that of top grade hard drives, but that will surely change. HD's have been "just above" 100MBps for quite some time and not making dramatic shifts, they will certainly be overtaken.

Are hard drives out? Not at all, for now their cost & data density is unparalleled. But I think the MapReduce situation changes considerably when you can think about chunking half a terabyte datasets onto very-high-IOPs nearline storage. Theres still a role for MapReduce to linearly chunk and streaming data onto SSD, but the chunk-size mapreduce will deal with will be able to be far bigger, giving new life to random access.

Which is a good thing, because as great as mapreduce is theres a lot of places where random access is unavoidable. My concurrency professor liked to say that there were a lot of problems that were "embarrassingly parallel," and MapReduce is a wonderful way to chunk up these jobs and distribute them, but not all problems can localize and subdivide themselves so.

Anonymous said...

You are simply not correct. I suggest that you read more on how disk based sorting alg works. Each disk based sorting alg have a sort phase (run formation) and a merge phase (sorting the differnt runs). However, during the merge phase the alg does not know where the smallest number will reside, hence it must do a seek to read small amount of I/O from each run file (this is the N way merge). So unless you store each run file on a seperate disk you will do a seek. The same alg will be used on any external disk based sorting alg regardless of its name (RDBMS or map reduce)

Anonymous said...

Not to mention that in 3-5 years SSD drives will be standard equipment. Their random vs. sequential reads are about equal.

Perhaps the idea of drives = tapes is applicable in the sense that drives with their hugely increasing capacities will turn into more of a backup medium.

Anonymous said...

eric14!

Anonymous said...

Uh, exactly what the fuck does all this MEAN?

Is it a translation of the original Sanskrit, or what?

Anonymous said...

Reddit comments.

Anonymous said...

@2 regarding external sort/merge, N way merging (when N typically < 10) doesn't incur a seek per record but per buffered read per spindle, while typical BTree nodes are much smaller than the read buffers of these streams. So BTree indeed incur many more seeks when read/write large tables.

Also map-reduce typically runs on a DFS, which means indeed that the streams are more likely from many spindles.

That's how Bigtable and similar new generation DBs like Hypertable (and theoretically HBase) can achieve disk transfer rate even for random writes to petabyte sized table.

Flash SSD is an interesting diversion, for read-mostly apps, as frequent writes can wear it out quickly.

Hopefully MRAM will save the day, which is about 10 years away to reach the feasible cost and density.

Anonymous said...

What about the deterioration of SSDs. As I understand it, there is a finite amount of flippability for the bits, so SSD is better for something where you are going to write less frequently and read more often.

Anonymous said...

To follow up on what I said earlier:

Yes, deterioration is a concern. That's why I said 3-5 years to become standard equipment so those issues can be resolved / improved. Besides, it is the 'data'. not the drives that hold the value. Your data is backed up. right?

The other benefit that SSD devices offer is that their reads (transfer / seek) are consistent across the entire chip. Hard drives are not, transfer / seek times vary according to where on the platter you are reading from.

Interesting stuff though.

Anonymous said...

vicaya...

Disk are never accessed per record. The smallest unit of I/O for direct I/O is a page size (8K).
Also, The N in N way merge is usually much larger than 10 -
In the sort phase, you make runs in the size of your memory which is, for example, 4G. So if you have 1TB file, you would have 250 runs.
Also, I am not talking about BTREE. BTREE are good for few record lookup but are bad for large record scans (Unless the BTREE covers your search keys, in which case you can scan the BTREE).

I do not understand how SSD are more benefical to mapreduce than to RDBMS. The key here is to look at the I/O patterns of the alg and not at the specific hardware.

Anonymous said...

I dont believe that SSDs will catch up to disks anytime in the next 5 years. And if they do then people will invent aplications that need more storage, magnetic being cheaper than solid state. For example, there is no way that I can fit my music collection on any SSD (> 256GB), and it is still growing, and I still have not started with video, and my digital pics are 10 Megapixels now. Sure enough, the Seagate drives I use to store my MP3 feel slower with every new increase in capacity. Finally, I work in IT at a big firm. It used to be the case that 2GB file size limit was OK, our customers didn't seem to want to keep so m uch data around. Of course, that was then, this is now: 4GB files are the norm (we built our own tiny DB access library for these using LFS under Linux).

Anonymous said...

SSD does not compete with disk but live as part of the cache hirerchy (L1->L2->RAM->SSD->DISK->TAPE). They actually solve a very real problem (at X10 price). As an analogy: going from RAM->DISK feels like traveling 100 miles back and forth each time you need some book from your library, while accessing SSD should feel like traveling a mile.

SameerDS said...

The text is totally confused about the terms "foobar rate" and "foobar time". At one point the text says "transfer rate is increasing" and somewhere else it says "transfer time is increasing". It would seem to me, that if time and rate would be inversely proportional to each other, and definitely not interchangeable like this.

Theodore Nordsieck said...

vicaya:

with current wear leveling techniques, modern (high end) flash drives has a roughly equivalent MTBF of hard drives. That will only go up with time.

Theodore Nordsieck said...

sameerds:

rate = size/time

file size is not a constant over disk generations

Julien Couvreur said...

Here's a very good article which summarizes the bandwidth and latency between the various components in modern computers (CPU, caches, RAM, bus, HDD, network):
http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait

That data is very much in line with the Hadoop design principles (stream from disk).

Julien Couvreur said...

Another thing about Flash.
Projections show that it will become cheaper than SATA by 2012.
Also, it offer very good bandwidth. That said latency is only a bit better than HDD (much slower than RAM).

This means that sequential data access is still highly desirable, as you can pre-fetch and buffer access (use bandwidth without suffering from latency).