Hadoop – Hbase Cluster with Docker on AWS

Hadoop – Hbase Cluster with Docker on AWS

 1. Problems of Hadoop1
– data streaming
– map process and reduce process are seperated
– job tracker manage all jobs alone (too busy)
. cannot manage resource(cpu, memory) effectively
– SPOF weakness (name node dies whole system dies)

2. Solution of Hadoop2 
– job tracker is too busy => Yarn
– namenode availability => Zookeeper

3. Hbase – Thrift – Happybase
– HDFS & Mapreduce those are not real time based service, they are batch service
– Hbase support real time big data service with Hadoop
– Thrift and Happybase help you to use it from other system (like rdb)

4. Docker
– Docker is OS free container based system
– It allow user to use 95% of original H/W capacity
– Support docker build & image, faster to install Hadoop systems

5. Problems to solve
– AWS : need to use same security groups
need to open ports to communicate (ex : 22, 50070, etc)
– Docker : need to use ssh 22 port
docker on CentOS allows only 10G

6. AWS setting
[Link : using EC2 service]

– make security group
– add instance with same security group
– add inbound rules
– open all ICMP rules

50070 , 6060, 6061, 8032, 50090, 50010, 50075, 50020, 50030, 50060, 9090, 9091, 22

– change ssh port of AWS
change AWS ssh port from 22 to something else so that Docker can use port 22 with -p 22:22 . but by doing this you should specify the port to something you change every time you try to access AWS with ssh

vi /etc/ssh/sshd_config
# find port:22 and change 
port : 3022
sudo service sshd restart

7. Docker setting
[Link : install Docker]

[Link : use Docker]

(1) Docker Build (option 1)
– downloads : [dockerfiles]
– unzip :  unzip Dockerfiles
– change name : cp Dockerfiles-Hbase Dockerfile
– copy conf : cp hadoop-2.7.2 /* .
– build : # docker build –tag=tmddno1/datastore:v1 ./

(2) Download Docker Image (option 2)

docker pull tmddno1/datastore:v1

 (3) Create Daemon Container

docker run --net=host -d <imageid>

(4) Exec Container with bash

docker exec -it <container id> bash

 (5) change sshd_config 

vi /etc/ssh/sshd_config
# change bellow

PasswordAuthentication yes
PermitRootLogin yes

/etc/init.d/ssh restart

 8. SSH Setting

[AWS pem file share]

– WinSCP Download [Download]
– upload your aws pem file on master node with using WinSCP

[AWS pem file add on ssh-agent]

eval 'ssh-agent -s'
eval $(ssh-agent) 
chmod 644 authorized_keys
chmod 400 <pem_keyname>
ssh-add <pem_keyname>

[ /etc/hosts] hadoop-master hadoop-slave-1 hadoop-slave-2

[ssh – rsa key share]

$ ssh-keygen -t rsa 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave1 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave2 
$ chmod 0600~/.ssh/authorized_keys $ exit

9. Run Hadoop


vi /hadoop-2.7.2/etc/hadoop/slaves
# set hosts defined on /etc/hosts

[modify core-site.xml]



hadoop namenode -format
start-all.sh(start-dfs.sh + start_yarn.sh)
stop-all.sh(stop-dfs.sh + stop_yarn.sh)

10. ZooKeeper



[make dummy my id file]

cd /root/zookeeper
vi 1  <== on the master
vi 2  <== on the slave 2

[start zookeepr on every server]

/zookeeper/bin/zkServer.sh start

11. Hbase Setting


export HBASE_MANAGES_ZK=false


  <value>master, slave2, slave3</value>


# hbase/conf/regionservers


[start hbase]


12. Hbase Thrift

- hbase start : start-hbase.sh
- thrift start : hbase thrift start -p <port> --infoport <port>

13. Check Site server running correctly

Yarn : http://localhost:8088
Hadoop : http://localhost:50070
Hbase : http://localhost:9095

14. Hbase Shell Test

hbase shell

15. Install happy base (on the client server)
– I will explain about happybase later 

sudo yum install gcc
pip install happybase


Hadoop MapReduce – word count (improve)

About Map Reduce Code

 1.Ordering with Map Reduce

  (A) Binary Search

we are going to make a map reduce program which return N numbers of   keywords from the top rank (ordered by number of appears)
Hadoop support beautiful sorting Library which is called PriorityQueue and by calling peek you can get keyword on the last of the pool.bellow code remove items from the queue from the last until the size of queue meets the required top number that user request

public static void insert(PriorityQueue queue, String item, Long lValue, int topN) {
  ItemFreq head = (ItemFreq)queue.peek();

  // 큐의 원소수가 topN보다 작거나 지금 들어온 빈도수가 큐내의 최소 빈도수보다 크면
  if (queue.size() < topN || head.getFreq() < lValue) {
    ItemFreq itemFreq = new ItemFreq(item, lValue);
    // 일단 큐에 추가하고 
    // 큐의 원소수가 topN보다 크면 가장 작은 원소를 제거합니다.
    // if (queue.size() > topN && head != null && head.getFreq() < lValue) {
    if (queue.size() > topN) {


Hadoop Map Reduce – word count

Build & Run Example Code

 1. download maven
– download maven build tool from site using apt-get

sudo apt-get install maven

 2. get test source code using wget

wget https://s3.amazonaws.com/hadoopkr/source.tar.gz

 3. build source with mvn

cd /home/<user>/source/<where pom.xml>
mvn compile

5. upload local file to hadoop

hadoop fs -copyFromLocal README.txt /

 6. execute on hadoop

hadoop jar <jar file name> wordcount /README.txt /output_wordcount

About Map Reduce Code

 1. Hadoop In/Out

k1,v1 => mapper => k2,v2 => reducer =>k3,v3

Mapper<LongWritable, Text, Text, LongWritable>
Reducer<Text, LongWritable, Text, LongWritable>

 (A) Plain Text => Key Value

if the input content is key/value already this process is not needed
good example of key/value type data is CSV(comma seperated value)
when the input value is plain text we need to attach key(which is meaningless)
char count / line by line content

[Plain Text]
Let’s say bellow example is a plain text

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;

Covert the plain text to key/value type result will be like bellow
key : character count / value : line by line text contents

0, import org.apache.hadoop.fs.Path;
50, import org.apache.hadoop.conf.*;
90, import org.apache.hadoop.io.*;

 (B) Mapper:  Key(count)/Value(content)  =>   Key(word)/Value(count)

Mapper called as numbers of lines of input content (input is char count/ line and output is line text) , output will be key(word), value(count).
on the mapper we need to specify the delimeter(comma, space, tab which on to use, set things you wanna remove, space is default delimeter)

[set delimeters]

new StringTokenizer(line, "\t\r\n\f |,.()<>");

[set word to lower case]


[mapper class]

public class WordCount {

 public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    private final static LongWritable one = new LongWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line, "\t\r\n\f |,.()<>");
        while (tokenizer.hasMoreTokens()) {
            context.write(word, one);

output will be like bellow (for example)

[IN – data example]

0, import org.apache.hadoop.fs.Path;
50, import org.apache.hadoop.conf.*;
90, import org.apache.hadoop.io.*;

[OUT – data example]

import , 1
org , 1
apache, 1
hadoop, 1

 (B) Mapper:  Key(word)/Value(iter count)  =>   Key(word)/Value(count)

what reduce do is sum the count, times of reducer called is number of words

[IN – data example]

import , [1,1,1,1,1]
org , [1,1,1]
apache, [1,1]
hadoop, [1,1,1,1]

[OUT – data example]

import , 5
org , 3
apache, 2
hadoop, 4

[reducer code]

public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
   private LongWritable sumWritable = new LongWritable();

   public void reduce(Text key, Iterable<LongWritable> values, Context context)
     throws IOException, InterruptedException {
       long sum = 0;
       for (LongWritable val : values) {
           sum += val.get();
       context.write(key, sumWritable);




Install Hadoop on Docker

  • Get Ubuntu Docker

    – docker pull ubuntu

  • Start Container
docker run -i -p 22 -p 8000:80 -m /data:/data -t <ubuntu> /bin/bash
  • Install Jdk
    sudo add-apt-repository ppa:openjdk-r/ppa  
    sudo apt-get update   
    sudo apt-get install openjdk-7-jre
  • .bashrc
    export JAVA_HOME=/usr/lib/jvm/...
    export CLASSPATH=$JAVA_HOME/lib/*:.
    export PATH=$PATH:$JAVA_HOME/bin
  • HADOOP 1.2.1 install

    download hadoop and unpack

    root@4aa2cda88fcc:/home/kim# wget http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
    root@4aa2cda88fcc:/home/kim# mv ./hadoop-1.2.1.tar.gz /home/user
    root@4aa2cda88fcc:/home/kim# tar xvzf hadoop-1.2.1.tar.gz
  • SET Configuration

    set configuration on ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export CLASSPATH=$JAVA_HOME/lib/*:.
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/home/kim/hadoop-1.2.1
  • set HADOOP conf
vi  /home/hadoop-1.2.1/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
vi mapred-site.xml

vi hdfs-site.xml

vi core-site.xml

# apt-get install openssh-server
# ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
# vi /etc/ssh/sshd_config 
# ==> PermitRootLogin yes
service ssh restart
# ssh localhost
  • Format namenode
hadoop namenode -format
  •  start & stop Shell script
start-all.sh  (start-dfs.sh + start-mapred.sh) 

# check java apllication is running 

  • search hadoop files
hadoop fs -ls /
  • upload local file to hadoop
hadoop fs -copyFromLocal README.txt /
  • Execute word count
hadoop jar hadoop-examples-1.2.1.jar wordcount /README.txt /output_wordcount

working on find way to use “docker exec -ti containername sh"