Tom White: 2011

Saturday 4 June 2011

What's new in Apache Whirr 0.5.0-incubating

Apache Whirr 0.5.0-incubating is now available. Whirr is a library and command line interface for running distributed services like Apache Hadoop in the cloud. Note that Whirr is currently undergoing Incubation at the Apache Software Foundation, which means that, in particular, the project has yet to be
fully endorsed by the ASF. Please read the full disclaimer.

In this release the Whirr development team have added many new features while still making the core more solid. This post covers some of the more important changes. The full list can be found in the release notes.

Improving the new user experience

Orchestrating multiple services on cloud instances is a challenge to make simple, and Whirr has sometimes been a little fiddly to get running. SSH settings, in particular, have been a common sticking point with new users. The new Whirr in 5 Minutes guide walks through the minimum number of commands you need to type to get a simple 3-node ZooKeeper cluster running in a few minutes. From there you can move on to the Quick Start Guide and the Configuration Guide.

The sample configurations in the recipes directory in the distribution contain useful settings for running the services on a variety of cloud providers. Users are always encouraged to share their working configurations with the community.

New services

Elastic Search and Voldemort have been added to the roster of services that come with Whirr. This brings the total to six; adding to Apache Cassandra, Apache Hadoop, Apache HBase, and Apache ZooKeeper.

API improvements

Whirr is still a young project so it is not surprising that its API is rapidly evolving. In WHIRR-245, the demarcation between the user API (for users who control Whirr clusters from Java) and the service API (for developers writing new Whirr services) was clarified. The user API can be found in the org.apache.whirr package; whereas the service API is in org.apache.whirr.service.

You can find out more about writing Whirr services in this presentation (PDF).

The firewall API that service writers use to open ports for services was simplified and made more powerful in WHIRR-275.

Overriding scripts

This feature was actually introduced in Whirr 0.4.0-incubating, but it's useful enough to mention here. In older versions of Whirr, if you wanted to make a modification to the scripts that run on cloud instances - to tweak some settings, for instance - you would have to upload your modifications (as well as all the other scripts) to a publicly available web server (Amazon S3 was a common choice), then point Whirr at the new location. Not particularly difficult, but a big enough barrier to discourage users from trying it.

The new approach is to push scripts to nodes from the launching machine, so you can just edit them locally before launch. Full instructions are covered in the FAQ.

Running scripts on nodes

In 0.5.0 the scripts that run on cloud instances have been broken up to be more fine-grained, so many services have individual start and stop scripts (WHIRR-266). Combined with the ability to run scripts on sets of nodes in the cluster (by ID or role), users now have more control of the cluster once it has launched (WHIRR-173). Try running whirr run-script at the command line to use this feature. There's a contrib script to run the Yahoo! Cloud Serving Benchmark (YCSB) against an HBase cluster, which takes advantage of the run-script command (WHIRR-287).

Also useful is WHIRR-291, which allows you to launch "blank" nodes with no services running on them (in a "noop" role), and then, with whirr run-script, run arbitrary scripts on them to bring them into the state you want.

Custom service builds

Developers who work on services supported in Whirr will find the ability to push a custom build to a cluster very useful for testing (WHIRR-220). For example, if you are working on a ZooKeeper feature, you can build a ZooKeeper tarball with your new feature, then launch a cluster that uses this tarball by specifying whirr.zookeeper.tarball.url as a local file:// URL pointing to your tarball. Whirr will push the tarball to a temporary blob store container, then each node will download from there.

I used a variation of this feature to try out a nightly Hadoop 0.22 build on a small Whirr cluster. In this case the tarball URL is not a local file, so Whirr doesn't copy the tarball to a blob store since it is already accessible from the cloud.

Service improvements

Whirr is only able to exist because of the powerful abstraction that jclouds provides for interacting with cloud providers. A great example of this power is the API that jclouds provides for discovering the hardware capabilities of an instance running on any provider. WHIRR-282 took advantage of the jclouds API to find the number of cores on a node to dynamically configure the number of slots in a Hadoop cluster. Previously, you had to set this manually for each cluster to take full advantage of larger image sizes.

This is just the beginning - there is more work to use memory capabilities to set configuration (WHIRR-229), and to use hardware capabilities generally in services other than Hadoop.

Cluster state storage

In previous releases of Whirr, information about launched instances was stored in a file on the machine that launched the cluster (~/.whirr/<cluster-name>/instances). With WHIRR-288, it's now possible to store this information in a blob store instead (such as Amazon S3, although any jclouds-supported blob store can be used), which is useful if you want to control clusters from multiple machines.

Bring Your Own Nodes

Or just BYON, for short. Many users have requested the ability to deploy to privately owned hardware - and jclouds added this feature in 1.0-beta-9. Whirr now has preliminary support for BYON clusters. In a nutshell, you write a YAML file enumerating the nodes to deploy to - their addresses, access credentials, etc. - then Whirr will start services on them. The nodes just need to have a base OS like Centos or Ubuntu installed. You can find an example BYON configuration in the recipes directory of the download.

BYON is also useful for testing locally by using VMware or VirtualBox to host target nodes.

A hummingbird

Last, but not least, Whirr finally has a logo! Many thanks to Alison Wong, who designed it and donated it to the ASF.

Credits

I would like to thank everyone who helped with the 0.5.0-incubating release. We have a growing community, and we welcome feedback and help from new users and developers. If you'd like to get involved you can start by downloading the new release and joining us on the mailing lists.

What's next?

It's difficult to make firm predictions about the contents of the next release since Whirr is an open source project with many open issues, but the general themes include:

Adding more services. In tandem, we want to make it easier to write new services by pushing common patterns into the core (e.g. WHIRR-326 is one example of this).
Improving existing services. By making them more flexible, better configured, easier to manage.
Adding more cloud providers. The latest release of jclouds supports 30 providers, and we need help testing more of them with Whirr.
Implementing services using other configuration management tools, rather than bash scripting. Andrei Savu is working on using Puppet to write new services (WHIRR-255).
Supporting elastic clusters, so new nodes can be added to running clusters (WHIRR-214).

Monday 9 May 2011

Do Donors Choose Local Schools?

DonorsChoose.org is a site where people donate money to school projects. For example, a teacher in Iowa might create a project request for some beanbags to create a reading area for her pupils. Then, via the website, donors can give as much or as little as they like to the project, and once the target is reached DonorsChoose purchase and deliver the beanbags to the school.

DonorsChoose are running a contest. They have opened up their data, and are challenging developers to "make discoveries and build apps that improve education in America".

I thought I'd do a little hack to answer the question "Do donors tend to choose their local schools?"

I wrote a short Python program to calculate the distance between each donor's address (where it was provided) and the address of the school for the project they were donating to. Then, using R, I plotted the following histogram:

It's striking that many donors are local. In fact, in my analysis, one in four donors live within four miles of the school they are donating to, and the median distance is 128 miles. However, there is a long tail reaching to over 5000 miles!

If we use a logarithmic scale for the y-axis (count), then a couple of features jump out. This plot is a scatter plot where counts are bucketed by integer distance.

There is a small peak at around 2500 miles, which is puzzling until you realize that this is the approximate distance between the East Coast and West Coast of the USA, where the majority of the population is located. I'm guessing that this bump corresponds to people who donate to schools of friends and relatives on the other coast.

The other noticeable feature is the significant drop off after 2500 miles. This small number of donations is where the donor or school is located in the non-contiguous states (Alaska and Hawaii), which have only a small fraction of the total population.

How I produced the images

I wrote a Python program to parse the CSV data from DonorsChoose. It reads two data files - the projects file and the donations file. The files are joined by the project ID field, which means we can access the school ZIP code (from the projects file), and the partial ZIP code of the donor (from the donations file). The donor's ZIP code is optional (and was actually only present in 46% of donations, so the results are restricted to this subset of donations). Also, for privacy reasons, only the first 3 digits of the donor's ZIP code are provided by DonorsChoose. This makes the distance measurements less accurate, particularly for local donors.

In the case of the partial ZIP code matching the school ZIP code, I set the distance to zero, on the assumption that the donor lives close to the school. This assumption will tend to overcount the zero distance case, and undercount small distances.

If the partial ZIP code did not match the school ZIP code, I chose a ZIP code with that prefix at random and calculate the distance between that ZIP code and the school's ZIP code. For this calculation I used Kevin T. Ryan's Python code at ActiveState, which I modified slightly to support partial ZIP codes.

The program buckets integers distances and writes the counts to a file. I then used R to plot the distributions show above.

I've put all my code into a GitHub repository.

This hack just scratches the surface of the dataset, and I look forward to seeing some of the cool things that others do in this contest. The closing date is June 30, 2011.

Saturday 16 April 2011

Whirr in 5 Minutes

A couple of days ago I wrote down a sequence of command lines to install Apache Whirr (an incubator project for running distributed systems on various cloud providers) and run a service from scratch. You just need Java, SSH, and some cloud credentials (Amazon EC2 in this case): I've reproduced the commands here:


export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
curl -O http://www.apache.org/dist/incubator/whirr/whirr-0.4.0-incubating/whirr-0.4.0-incubating.tar.gz
tar zxf whirr-0.4.0-incubating.tar.gz; cd whirr-0.4.0-incubating
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
bin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

At this point you should have a 3 node ZooKeeper cluster running, which is easily checked with


echo "ruok" | nc $(awk '{print $3}' ~/.whirr/zookeeper/instances | head -1) 2181; echo

You can shutdown the cluster with the following command.


bin/whirr destroy-cluster --config recipes/zookeeper-ec2.properties

There are recipes for more services in the Whirr download package, and more detailed instructions in the Quick Start Guide.