Developer Network Home - Help

Hadoop and Distributed Computing at Yahoo!

Recent Posts
Recent Links
Archives

Subscribe (blog):
Add to My! Yahoo Get free Alerts via RSS

Comment Policy

We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service.

Trademark

Hadoop is a trademark of the Apache Software Foundation.

Hadoop and Distributed Computing at Yahoo!

Apache Hadoop Wins Terabyte Sort Benchmark

July 2, 2008

One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. Yahoo is both the largest user of Hadoop with 13,000+ nodes running hundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usage and contributions are increasing rapidly.

The cluster statistics were:

The benchmark was run with Hadoop trunk (pre-0.18) with a couple of optimization patches to remove intermediate writes to disk. The sort used 1800 maps and 1800 reduces and allocated enough memory to buffers to hold the intermediate data in memory. All of the code for the benchmark has been checked in as a Hadoop example.

Owen O'Malley
Yahoo! Grid Computing Team

Comments (19)

Hadoop 0.17 Preview

April 28, 2008

Apache Hadoop 0.17 is due for release any day now. Feature freeze for the release was on April 4th. The Hadoop dev community is currently actively fixing blocking issues discovered by users that have tried it out. This is a release we’re very excited about as it introduces many long awaited performance fixes to the platform. We’ve observed on the order of 30%(!) improvement in the runtime of some of the Hadoop benchmarks. As always, user feedback is invaluable and we urge folks to kick the tires on the release and help close it out. Here is a quick rundown of the important changes in the release.

HDFS

 

Map/Reduce

 

Sameer Paranjpye
Yahoo! Grid Computing Team

Comments (1)

VIM Color Syntax Highlighting for Pig

April 25, 2008

I joined the Yahoo! Research Engineering group a few weeks ago, and I was literally blown away with the possibilities that Hadoop and Pig open for me. Immediately, I wanted to hack up something good to say thank you to all smart people that build and support such a great software.

I am convinced that Pig deserves more respect from the major text editors, so I wrote a small vim script that adds syntax highlighting for Pig files.

pig in vim

You can download it from vm.org site.

To install, follow instructions on the web page, and don't forget to vote! :-)

Emacs version is coming up soon (yes, I use both vim *and* emacs). It will be my project for the upcoming Yahoo! Hack Day.

Sergiy Matusevych
Yahoo! Research Engineer

Comments (1)

Hadoop Summit Slides and Video Available

April 18, 2008

It's been a few weeks since the Hadoop Summit in Santa Clara, and we hope everyone had a good time and learned a lot. Feedback has been quite good so far, but don't be shy about sending us comments.

The Yahoo! Research team has assembled a single page containing links to all the presentation slides and video from both the Hadoop Summit and the Data Intensive Computing Symposium.

As a sample, here's the opening presentation that Doug and Eric gave:

Update: Videos are currently unavailable outside of Yahoo! We're working on the problem...

Comments (5)

More Hadoop Summit Seats Available! New Venue too.

March 12, 2008

To say that we've been surprised by the interest in attending the Hadoop Summit would be an understatement. We already expanded the capacity once and that filled up in a matter of hours. And that pretty much maxed out the event budget and parking too.

So last week when our friends at Amazon Web Services got in touch to see if they could help, we started working on a plan to make the event even larger while still keeping it free. Before long, we'd hatched a plan that involved moving off-site to a nearby venue, more food, more T-shirts, and some minor schedule tweaking.

But most importantly, we have room for about 75 more people!

As of now, we've increased the capacity of the event on Upcoming.org. If you're been watching and waiting to get on the liste of attendees, the time is now.

The new venue is the Network Meeting Center which is located in Santa Clara, California.

Thanks to Amazon.com for buying everyone lunch during the summit. :-)

We'll be updating the agenda soon to include Jinesh from Amazon who will discuss GrepTheWeb - Hadoop on AWS. As you may know, Amazon has many customers running Hadoop on EC2.

If you cannot attend, we're still planning to record all the talks and put them on-line within a week after the summit date.

See you at the summit!

See Also: Hadoop Summit also scaled on-demand! on the Amazon Web Services blog.

Jeremy Zawodny
Yahoo! Developer Network

Comments (2)

An Introduction to ZooKeeper Video

March 7, 2008

A few weeks ago, I had the chance to capture video of a presentation given by Benjamin Reed from Yahoo! Research. His presentation was an introduction to ZooKeeper, a highly available and reliable coordination system built by Yahoo! Research and released under the Apache License, Version 2.0.

Preparing to post the video, I asked Ben for a a summary of the motivations for building ZooKeeper. Here's what he had to say:

In 2006 we were building distributed applications that needed a master, aka coordinator, aka controller to manage the sub processes of the applications. It was a scenario that we had encountered before and something that we saw repeated over and over again inside and outside of Yahoo!.
For example, we have an application that consists of a bunch of processes. Each process needs be aware of other processes in the system. The processes need to know how requests are partitioned among the processes. They need to be aware of configuration changes and failures. Generally an application specific central control process manages these needs, but generally these control programs are specific to applications and thus represent a recurring development cost for each distributed application. Because each control program is rewritten it doesn't get the investment of development time to become truly robust, making it an unreliable single point of failure.
We developed ZooKeeper to be a generic coordination service that can be used in a variety of applications. The API consists of less than a dozen functions and mimics the familiar file system API. Because it is used by many applications we can spend time making robust and resilient to server failures. We also designed it to have good performance so that it can be used extensively by applications to do fine grained coordination.
We have found ZooKeeper to be applicable to many distributed applications inside of Yahoo! and expect it to be applicable to many more outside of Yahoo! For that reason we released it as open source under the Apache license. If you are writing a distributed application, ZooKeeper can help.

And here's the video...

download (m4v)

A PDF copy of the slides is available too.

While filming his 1 hour presentation, I found myself really wishing that ZooKeeper was available 6 or 7 years ago when I was struggling with how to perform distributed processing of news feeds for Yahoo! Finance. ZooKeeper is clearly a more elegant solution than the hack we put together!

Ben will be speaking about ZooKeeper later this month at the Hadoop Summit.

More videos are available on YDN Theater.

Jeremy Zawodny
Yahoo! Developer Network

Comments (0)