You may have seen article, Hadoop Example, AccessLogCountByHourOfDay. This is a distributed computing solution, using Hadoop. The purpose of this article is to dive into the theory behind this.

To understand the power of distributed computing, we need to step back and understand the problem. First we’ll look at a command line java program that will process each http log file, one file at a time, one line at a time, until done. To speed up the job, we’ll then look at another approach: multi-threaded; we should be able to get the job done faster if we break the job up into a set of sub tasks and run them in parallel. Then, we’ll come to Hadoop, distributed computing. Same concept of breaking the job up into a set of sub tasks, but rather than running with one server, we’ll run on multiple servers in parallel.

At first you’d think that Hadoop would be the fastest, but in our basic example, you’ll see that Hadoop takes isn’t significantly faster. Why? The Hadoop overhead of scheduling the job and tracking the tasks is slowing us down. In order to see the power of Hadoop, we need much larger data sets. Think about our single server approach for a minute. As we ramp up the size and/or number of files to process, there is going to be a point where the server will hit resource limitations (cpu, ram, disk). If we have 4 threads making use of 4 cores of our CPU effectively, we may be able to do a job 4 times faster than single threaded. But, if we have a terabyte of data to process and it takes say 100 second per GB, it’s going to take 100,000 seconds to finish (that’s more than 1 day). With Hadoop, we can scale out horizontally. What if we had a 1000 node Hadoop cluster. Suddenly the overhead of scheduling the job and tracking the tasks is minuscule in comparison to the whole job. The whole job may complete in 100 seconds or less! We went from over a day to less than 2 minutes. Wow.

Please note: the single thread and multi-threaded examples in this article are not using the Map/Reduce algorithm. This is intentional. I’m trying to demonstrate the evolution of thought. When we think about how to solve the problem, the first thing that comes to mind is to walk through the files, one line at a time, and accumulate the result. Then, we realize we could split the job up into threads and gain some speed. The last evolution is is the Map/Reduce algorithm across a distributed computing platform.

Let’s dive in….
Read the rest of this entry »

Inspired by an article written by Tom White, AWS author and developer:
Running Hadoop MapReduce on Amazon EC2 and Amazon S3

Instead of minute of the week, this one does by Hour Of The Day. I just find this more interesting than the minute of the week that’s most popular. The output is:
00\t

23\t

The main reason for writing this, however, is to provide a working example that will compile. I found a number of problems in the original post.
Read the rest of this entry »

Watched a set of 3 lectures run at Google, by Aaron Kimball, on MapReduce was inspiring to me. I feel like I have a much more solid grasp on MapReduce after watching these. I really liked how it started out with some basic functional programming and socket theory, then moved into how MapReduce builds on the basic principle of Map and Fold. Well worth the 3 hours!

http://www.youtube.com/watch?v=yjPBkvYh-ss

Thank you Google and Aaron Kimball!

Installed hadoop on a VM, and needed to set the java heap size, -Xmx1000m, lower than the default 1000 to get it to work.  I set the HADOOP_HEAPSIZE var in the conf/hadoop-env.sh dir to the lower value, but hadoop continued to spit out this error:

# hadoop -help Could not create the Java virtual machine. Exception in thread "main" java.lang.NoClassDefFoundError: Could_not_reserve_enough_space_for_object_heap Caused by: java.lang.ClassNotFoundException: Could_not_reserve_enough_space_for_object_heap         at java.net.URLClassLoader$1.run(URLClassLoader.java:200)         at java.security.AccessController.doPrivileged(Native Method)         at java.net.URLClassLoader.findClass(URLClassLoader.java:188)         at java.lang.ClassLoader.loadClass(ClassLoader.java:307)         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)         at java.lang.ClassLoader.loadClass(ClassLoader.java:252)         at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: Could_not_reserve_enough_space_for_object_heap.  Program will exit.

Didn’t matter what I set the HADOOP_MAXHEAP to, the problem persisted. I never did find the answer online, so figured I do the world a favor today and make a note about how to fix it. Maybe I’ll save someone else the 2 hours it took me to figure this out!

THE SOLUTION:
Read the rest of this entry »

bin/hadoop fs -put /path/to/source s3://<s3id>:<s3secret>@<bucket>/path/to/destination

This is so cool. I’m guessing that I could also use S3 as my input or output directory for Map/Reduce jobs.
Read the rest of this entry »

I’ve started my journey with Hadoop, and the first thing I wanted to try was Streaming, so I could run the mapper and reducer methods with PHP programs.

The first thing I did was setup an alias:

alias stream='/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.18.3-streaming.jar'

Read the rest of this entry »

I am doing a presentation on IPv6, at my company’s TechFest.  This is a day event with keynote speakers, and break out sessions.  The purpose of TechFest is to give the developers and engineers a break from their day to day activity and get a view of what’s going on around the company and in the industry.

In this article, I’m copy/pasting my slide deck, and stripping out the company specific information, making this a generic Introduction to IPv6.

The Agenda for Today:

What is IPv6? (~10 minutes)
DNS (~10 minutes)
Getting Started (~10 minutes)
Web Application Development (~10 Minutes)

Read the rest of this entry »