Hadoop reaches 1.0 and my understanding of how to use it is still in development...

LeafStorm · on Dec 29, 2011

A good introduction to MapReduce is probably CouchDB, where you use it for database views instead of SQL-style queries. The basic concepts are:

- The "Map" phase takes a key/value pair of input and produces as many other key/value pairs of output as it wants. This can be zero, it can be one, or it can be over 9000. Each Map over a piece of input data operates in isolation.

- The "Reduce" phase takes a bunch of values with the same (or similar, depending on how it's invoked) keys and reduces them down into one value.

A good example is, say you have a bunch of documents like this:

    {"type": "post",
     "text": "...",
     "tags": ["couchdb", "databases", "js"]}

And you want to find out all the tags, and how many posts have a given tag. First, you have a map phase:

    function (doc)
      if (doc.type === "post") {
        doc.tags.forEach(function (tag) {
          emit(tag, 1);
        });
      }
    }

In this case, it filters out all the documents that aren't posts. It then emits a `(tag, 1)` pair for each tag on the post. You may end up with a pair set that looks like:

    ("c", 1)
    ("couchdb", 1)
    ("databases", 1)
    ("databases", 1)
    ("databases", 1)
    ("js", 1)
    ("js", 1)
    ("mongodb", 1)
    ("redis", 1)

Then, your reduce phase may look like:

    function (keys, values, rereduce) {
      return sum(values);
    }

Though the kinds of results you get out of it depend on how you invoke it. If you just reduce the whole dataset, for example, you get:

    (null, 9)

Because that's the sum of the values from all the pairs. On the other hand, running it in group mode will reduce each key separately, so you get this:

    ("c", 1)
    ("couchdb", 1)
    ("databases", 3)
    ("js", 2)
    ("mongodb", 1)
    ("redis", 1)

Since the sum of all the pairs with "databases" was 3, the value for the pair keyed as "databases" was 3. You're not limited to summing - any kind of operation that aggregates multiple values and can be grouped by key will work as well.

Like you said, there are problems that this doesn't work for. But for the problems it does work for, it's very computationally efficient and fun.

mun2mun · on Dec 29, 2011

I have a question. I have read somewhere that map-reduce can leverage parallelism. So if I map a function to an array every element in the array is mapped with that function so that they can be executed parallely because they have no dependency on each other. But how do reduce leverage parallelism? As far as I understand output of the reduce function is dependent on the previous value.

scott_s · on Dec 29, 2011

In principle, reductions can often be staged, since there's no ordering requirements. Imagine a tree of reductions. But you are correct, the reduce phase is what will limit parallelism. If you have a cheap map operation, but a really expensive reduction, you may not see much scalability. (Where "scalable" is a way of saying "performance improves as available hardware increases because more parallelism inherent in the application is exploitable.")

jefft · on Dec 29, 2011

Hadoop is much more than MapReduce. It's a fault tolerant, highly scalable file system where you can store a shitload (technical term) of data. That alone is remarkable.

Once your data is there, then you get your map/reduce on. And the best way to get started there is to look into Pig or Hive (high level map/reduce abstractions). Either of those will take you a long way.

rjurney · on Dec 29, 2011

Don't write MapReduce. Check out my tutorials of late on http://datasyndrome.com and check out Apache Pig. That you can understand easily. It will MapReduce for you. Try Amazon Elastic MapReduce wih Pig on some logs.

simonw · on Dec 29, 2011

I found "Hadoop: The Definitive Guide" to be excellent (hah, I just noticed there's a quote from me on Twitter on the book's homepage): http://www.hadoopbook.com/

dice · on Dec 29, 2011

The Hadoop and HBase books from O'Reilly Media are quite good. They gave me a good overview and got me up to speed enough that I was comfortable stepping into a new personal project using them.

fletchowns · on Dec 29, 2011

Here's a nice explanation: http://ayende.com/blog/4435/map-reduce-a-visual-explanation

And a nice simple little hadoop setup: http://hadoop.apache.org/common/docs/current/single_node_set...