Holy Smokes, Hadoop works with S3 directly!

bin/hadoop fs -put /path/to/source s3://<S3ID>:<S3SECRET>@<BUCKET>/path/to/destination

This is so cool. I’m guessing that I could also use S3 as my input or output directory for Map/Reduce jobs.

For example:

/usr/local/hadoop/bin/hadoop jar \
  /usr/local/hadoop/contrib/streaming/hadoop-0.18.3-streaming.jar \
  -input s3://<S3ID>:<S3SECRET>@<BUCKET>/conf \
  -output s3://<S3ID>:<S3SECRET>@<BUCKET>/conf-wc_output \
  -mapper /usr/local/hadoop/scripts/wc_mapper.php \
  -reducer /usr/local/hadoop/scripts/wc_reducer.php

And, yes, it works, can cat the results:

bin/hadoop fs -cat s3://<S3ID>:<S3SECRET>@<BUCKET>/conf-wc_output/part*

I did find a bug, though:

bin/hadoop fs -ls 's3://<S3ID>:<S3SECRET>@<BUCKET>/'
Found 3 items
drwxrwxrwx   - ls: -0s
Usage: java FsShell [-ls <path>]

… the “ls” command doesn’t seem to work right against S3 directories. If found 3 items, that’s right, but it doesn’t list them correctly.

Anyway, my eyes are getting big. First thing that pops in my mind is use EC2 to spin up hadoop clusters to work on vast amounts of data stored on S3. Well, Amazon already had this idea: http://aws.amazon.com/elasticmapreduce/

Leave a Reply

Your email address will not be published. Required fields are marked *

*

* Copy this password:

* Type or paste password here:

16,278 Spam Comments Blocked so far by Spam Free Wordpress

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Spam Protection by WP-SpamFree