Hadoop Map Reduce – word count

Build & Run Example Code


 1. download maven
– download maven build tool from site using apt-get

sudo apt-get install maven

 2. get test source code using wget

wget https://s3.amazonaws.com/hadoopkr/source.tar.gz

 3. build source with mvn

cd /home/<user>/source/<where pom.xml>
mvn compile

5. upload local file to hadoop

hadoop fs -copyFromLocal README.txt /

 6. execute on hadoop

hadoop jar <jar file name> wordcount /README.txt /output_wordcount

About Map Reduce Code


 1. Hadoop In/Out

k1,v1 => mapper => k2,v2 => reducer =>k3,v3

Mapper<LongWritable, Text, Text, LongWritable>
Reducer<Text, LongWritable, Text, LongWritable>

 (A) Plain Text => Key Value

if the input content is key/value already this process is not needed
good example of key/value type data is CSV(comma seperated value)
when the input value is plain text we need to attach key(which is meaningless)
char count / line by line content

[Plain Text]
Let’s say bellow example is a plain text

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;

[Key/Value]
Covert the plain text to key/value type result will be like bellow
key : character count / value : line by line text contents

0, import org.apache.hadoop.fs.Path;
50, import org.apache.hadoop.conf.*;
90, import org.apache.hadoop.io.*;

 (B) Mapper:  Key(count)/Value(content)  =>   Key(word)/Value(count)

Mapper called as numbers of lines of input content (input is char count/ line and output is line text) , output will be key(word), value(count).
on the mapper we need to specify the delimeter(comma, space, tab which on to use, set things you wanna remove, space is default delimeter)

[set delimeters]

new StringTokenizer(line, "\t\r\n\f |,.()<>");

[set word to lower case]

word.set(tokenizer.nextToken().toLowerCase());

[mapper class]

public class WordCount {

 public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    private final static LongWritable one = new LongWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line, "\t\r\n\f |,.()<>");
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken().toLowerCase());
            context.write(word, one);
        }
    }
 }

output will be like bellow (for example)

[IN – data example]

0, import org.apache.hadoop.fs.Path;
50, import org.apache.hadoop.conf.*;
90, import org.apache.hadoop.io.*;

[OUT – data example]

import , 1
org , 1
apache, 1
hadoop, 1

 (B) Mapper:  Key(word)/Value(iter count)  =>   Key(word)/Value(count)

what reduce do is sum the count, times of reducer called is number of words

[IN – data example]

import , [1,1,1,1,1]
org , [1,1,1]
apache, [1,1]
hadoop, [1,1,1,1]

[OUT – data example]

import , 5
org , 3
apache, 2
hadoop, 4

[reducer code]

public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
   private LongWritable sumWritable = new LongWritable();

   public void reduce(Text key, Iterable<LongWritable> values, Context context)
     throws IOException, InterruptedException {
       long sum = 0;
       for (LongWritable val : values) {
           sum += val.get();
       }
       sumWritable.set(sum);
       context.write(key, sumWritable);
   }
}

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *