Problems worthy of attack prove their worth by hitting back. —Piet Hein

Saturday 28 July 2007

Hadoop at OSCON 2007

I wasn't there, but Hadoop had two airings at OSCON this week. Doug Cutting was a part of the Open Source Radar Executive Briefing with Tim O'Reilly to talk about scaling.

He also gave a talk, with Eric Baldeschwieler, entitled "Meet Hadoop" where he gave a great exposition of the problem that Hadoop is designed to solve. In short: disk seeks are expensive, so databases built using sort-merge, which is limited by transfer speed not seek speed, scale better than traditional B-tree databases, which are limited by seek speed. More details and examples on the slides.

Eric's half of the talk gave some interesting tidbits about how Hadoop is being used at Yahoo! For example, they are running Hadoop on about 10,000 machines, and the biggest cluster is 1600 machines! With these kind of numbers I can see how Nigel Daley came to coin Nigel's Law:

In a large enough cluster, there are NO corner cases

JUnit gets flexible and philisophical

My colleague Robert Chatley pointed me to an InfoQ announcement about the release of JUnit 4.4. It looks interesting. For a start it includes Hamcrest matchers which allow you to write flexible assertions, using the assertThat construct. I've had some involvement with Hamcrest, and I really like using it since it allows me to write tests that I can read, so I'm really pleased that it's got into JUnit as this can only increase its takeup. Well done Joe for having the idea in the first place!

The new release also includes theories, adopted from the Popper JUnit extension, which are tests that apply to a (potentially infinite) set of data points. This reminds me of Andreas Leitner's talk AutoTest: Push-button testing using contracts from the Google London Test Automation Conference (in 2006) where he talked about using Eiffel contracts to generate test cases to look for contract violations. David Saff is one of the creators of JUnit theories and he also talks about the relation to contracts in his paper (authored with Marat Boshernitsan), The Practice of Theories: Adding "For-all" Statements to "There-Exists" Tests. I'm looking forward to trying this out.

Thursday 19 July 2007

Articles and blogs

Coinciding nicely with Jakob Nielson's admonition Write Articles, Not Blog Postings (via Steve Loughran) here's my latest article, Running Hadoop MapReduce on Amazon EC2 and Amazon S3, published on Amazon Web Services Developer Connection.

It's almost a year and a half since I wrote my last article and one of the reasons is because I've blogged more. Not much, but more.

Jakob's got a point about being directed to short, old, irrelevant blog postings when you're looking for something. He's not saying don't blog, but rather try to write stuff that will be long-lived. Definitely something to aim for for tech writers - but not at the cost of not writing anything (so it's OK to sprinkle blogs with lighter weight stuff).

Monday 16 July 2007

Why are there no Amazon S3/EC2 competitors?

Amazon's storage and compute services (S3 and EC2 respectively) are widely seen as game changers. So, almost one year on from EC2's launch, why is it that there are no competitors in this space? One commenter on Artur Bergman's post entitled Amazon Web Services and the lack of a SLA made the good point that a "competitive utility computing market" would effectively solve any disaster recovery problems, and make such services even more popular.

Meanwhile, TechCrunch reports a rumour that Amazon will offer a MySQL web service by the end of the year.

Sunday 15 July 2007

Hadoop Development Steadily Rising

Judging by this graph showing posts to the dev list (on Gmane) the rate is currently at about 50 posts a day. This has roughly doubled since the beginning of the year. Some of the increase is down to the momentum behind Hbase (which provides Bigtable-like capabilities on top of Hadoop), but I think it is also down to general growth - more people seem to be participating in development than ever before. This is great! Obviously the risk is that at such a rate of development Hadoop becomes unstable, but so far it looks like it's under control - the (informal) feedback I've had tells me the 0.13.0 release is one of the most stable we've done.