Build & Run Example Code
1. download maven
– download maven build tool from site using apt-get
sudo apt-get install maven
2. get test source code using wget
wget https://s3.amazonaws.com/hadoopkr/source.tar.gz
3. build source with mvn
cd /home/<user>/source/<where pom.xml> mvn compile
5. upload local file to hadoop
hadoop fs -copyFromLocal README.txt /
6. execute on hadoop
hadoop jar <jar file name> wordcount /README.txt /output_wordcount
About Map Reduce Code
1. Hadoop In/Out
k1,v1 => mapper => k2,v2 => reducer =>k3,v3
Mapper<LongWritable, Text, Text, LongWritable>
Reducer<Text, LongWritable, Text, LongWritable>
(A) Plain Text => Key Value
if the input content is key/value already this process is not needed
good example of key/value type data is CSV(comma seperated value)
when the input value is plain text we need to attach key(which is meaningless)
char count / line by line content
[Plain Text]
Let’s say bellow example is a plain text
import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*;
[Key/Value]
Covert the plain text to key/value type result will be like bellow
key : character count / value : line by line text contents
0, import org.apache.hadoop.fs.Path; 50, import org.apache.hadoop.conf.*; 90, import org.apache.hadoop.io.*;
(B) Mapper: Key(count)/Value(content) => Key(word)/Value(count)
Mapper called as numbers of lines of input content (input is char count/ line and output is line text) , output will be key(word), value(count).
on the mapper we need to specify the delimeter(comma, space, tab which on to use, set things you wanna remove, space is default delimeter)
[set delimeters]
new StringTokenizer(line, "\t\r\n\f |,.()<>");
[set word to lower case]
word.set(tokenizer.nextToken().toLowerCase());
[mapper class]
public class WordCount { public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> { private final static LongWritable one = new LongWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line, "\t\r\n\f |,.()<>"); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken().toLowerCase()); context.write(word, one); } } }
output will be like bellow (for example)
[IN – data example]
0, import org.apache.hadoop.fs.Path; 50, import org.apache.hadoop.conf.*; 90, import org.apache.hadoop.io.*;
[OUT – data example]
import , 1 org , 1 apache, 1 hadoop, 1
(B) Mapper: Key(word)/Value(iter count) => Key(word)/Value(count)
what reduce do is sum the count, times of reducer called is number of words
[IN – data example]
import , [1,1,1,1,1] org , [1,1,1] apache, [1,1] hadoop, [1,1,1,1]
[OUT – data example]
import , 5 org , 3 apache, 2 hadoop, 4
[reducer code]
public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> { private LongWritable sumWritable = new LongWritable(); public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable val : values) { sum += val.get(); } sumWritable.set(sum); context.write(key, sumWritable); } }