Hadoop Streaming with PHP

I’ve started my journey with Hadoop, and the first thing I wanted to try was Streaming, so I could run the mapper and reducer methods with PHP programs.

The first thing I did was setup an alias:

alias stream='/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.18.3-streaming.jar'


The next thing was to create a scripts dir in my $HADOOP_HOME (/usr/local/hadoop) dir.

wc_mapper.php

#!/usr/bin/php
<?php
  error_reporting(0);
  $in = fopen("php://stdin", "r");
  $results = array();
  while ( $line = fgets($in, 4096) )
  {
    $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
    foreach ($words as $word)
      $results[$word] += 1;
  }
  fclose($in);
  foreach ($results as $key => $value)
    print "$key\t$value\n";
?>

wc_reducer.php

#!/usr/bin/php
<?php
  error_reporting(0);
  $in = fopen("php://stdin", "r");
  $results = array();
  while ( $line = fgets($in, 4096) )
  {
    list($key, $value) = preg_split("/\t/", trim($line), 2);
    $results[$key] += $value;
  }
  fclose($in);
  ksort($results);
  foreach ($results as $key => $value)
    print "$key\t$value\n";
?>

To execute:

stream -input conf -output output4 -mapper /usr/local/hadoop/scripts/wc_mapper.php -reducer /usr/local/hadoop/scripts/wc_reducer.php

I’ll come back later and document. Just wanted to get the initial recorded.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Spam Protection by WP-SpamFree