Hadoop – Hbase Cluster with Docker on AWS

Hadoop – Hbase Cluster with Docker on AWS


 1. Problems of Hadoop1
– data streaming
– map process and reduce process are seperated
– job tracker manage all jobs alone (too busy)
. cannot manage resource(cpu, memory) effectively
– SPOF weakness (name node dies whole system dies)

2. Solution of Hadoop2 
– job tracker is too busy => Yarn
– namenode availability => Zookeeper

3. Hbase – Thrift – Happybase
– HDFS & Mapreduce those are not real time based service, they are batch service
– Hbase support real time big data service with Hadoop
– Thrift and Happybase help you to use it from other system (like rdb)

4. Docker
– Docker is OS free container based system
– It allow user to use 95% of original H/W capacity
– Support docker build & image, faster to install Hadoop systems

5. Problems to solve
– AWS : need to use same security groups
need to open ports to communicate (ex : 22, 50070, etc)
– Docker : need to use ssh 22 port
docker on CentOS allows only 10G

6. AWS setting
[Link : using EC2 service]

– make security group
– add instance with same security group
– add inbound rules
– open all ICMP rules

50070 , 6060, 6061, 8032, 50090, 50010, 50075, 50020, 50030, 50060, 9090, 9091, 22

– change ssh port of AWS
change AWS ssh port from 22 to something else so that Docker can use port 22 with -p 22:22 . but by doing this you should specify the port to something you change every time you try to access AWS with ssh

vi /etc/ssh/sshd_config
----------------------------------------
# find port:22 and change 
port : 3022
----------------------------------------
sudo service sshd restart

7. Docker setting
[Link : install Docker]

[Link : use Docker]

(1) Docker Build (option 1)
– downloads : [dockerfiles]
– unzip :  unzip Dockerfiles
– change name : cp Dockerfiles-Hbase Dockerfile
– copy conf : cp hadoop-2.7.2 /* .
– build : # docker build –tag=tmddno1/datastore:v1 ./

(2) Download Docker Image (option 2)

docker pull tmddno1/datastore:v1

 (3) Create Daemon Container

docker run --net=host -d <imageid>

(4) Exec Container with bash

docker exec -it <container id> bash

 (5) change sshd_config 

vi /etc/ssh/sshd_config
-----------------------------------
# change bellow

PasswordAuthentication yes
PermitRootLogin yes
----------------------------------

/etc/init.d/ssh restart

 8. SSH Setting

[AWS pem file share]

– WinSCP Download [Download]
– upload your aws pem file on master node with using WinSCP

[AWS pem file add on ssh-agent]

eval 'ssh-agent -s'
eval $(ssh-agent) 
chmod 644 authorized_keys
chmod 400 <pem_keyname>
ssh-add <pem_keyname>

[ /etc/hosts]

192.168.1.109 hadoop-master 
192.168.1.145 hadoop-slave-1
192.168.56.1 hadoop-slave-2

[ssh – rsa key share]

$ ssh-keygen -t rsa 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave1 
$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave2 
$ chmod 0600~/.ssh/authorized_keys $ exit

9. Run Hadoop

[Slaves]

vi /hadoop-2.7.2/etc/hadoop/slaves
---------------------------------------
# set hosts defined on /etc/hosts
hadoop-slave-1
hadoop-slave-2 

[modify core-site.xml]

<configuration>
  <property>
   <name>fs.default.name</name>
   <value>hdfs://hadoop-master:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/root/tmp</value>
  </property>
</configuration>

 [start-all.sh]

hadoop namenode -format
start-all.sh(start-dfs.sh + start_yarn.sh)
stop-all.sh(stop-dfs.sh + stop_yarn.sh)

10. ZooKeeper

[zookeeper/conf/zoo.cfg]

server.1=master
server.2=slave2

[make dummy my id file]

cd /root/zookeeper
vi 1  <== on the master
vi 2  <== on the slave 2

[start zookeepr on every server]

/zookeeper/bin/zkServer.sh start

11. Hbase Setting

[hbase-env.sh]

export HBASE_MANAGES_ZK=false

[hbase-site.xml]

<configuration>
<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase:rootdir</name>
  <value>hdfs://master:9000/hbase</value>
</property>
<property>
  <name>hbase.master</name>
  <value>master:6000</value>
</property>
<property>
  <name>hbase.zookeeper.property.dataDir</name>
  <value>/root/zookeeper</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>master, slave2, slave3</value>
</property>
</configuration>

[regionservers]

# hbase/conf/regionservers

master
slave1
slave2
slave3

[start hbase]

start-hbase.sh

12. Hbase Thrift

- hbase start : start-hbase.sh
- thrift start : hbase thrift start -p <port> --infoport <port>

13. Check Site server running correctly

Yarn : http://localhost:8088
Hadoop : http://localhost:50070
Hbase : http://localhost:9095

14. Hbase Shell Test

hbase shell

15. Install happy base (on the client server)
– I will explain about happybase later 

sudo yum install gcc
pip install happybase

 

Hadoop MapReduce – word count (improve)

About Map Reduce Code


 1.Ordering with Map Reduce

  (A) Binary Search

we are going to make a map reduce program which return N numbers of   keywords from the top rank (ordered by number of appears)
Hadoop support beautiful sorting Library which is called PriorityQueue and by calling peek you can get keyword on the last of the pool.bellow code remove items from the queue from the last until the size of queue meets the required top number that user request

public static void insert(PriorityQueue queue, String item, Long lValue, int topN) {
  ItemFreq head = (ItemFreq)queue.peek();

  // 큐의 원소수가 topN보다 작거나 지금 들어온 빈도수가 큐내의 최소 빈도수보다 크면
  if (queue.size() < topN || head.getFreq() < lValue) {
    ItemFreq itemFreq = new ItemFreq(item, lValue);
    // 일단 큐에 추가하고 
    queue.add(itemFreq);
    // 큐의 원소수가 topN보다 크면 가장 작은 원소를 제거합니다.
    // if (queue.size() > topN && head != null && head.getFreq() < lValue) {
    if (queue.size() > topN) {
        queue.remove();
    }
  }
}

(B)

Hadoop Map Reduce – word count

Build & Run Example Code


 1. download maven
– download maven build tool from site using apt-get

sudo apt-get install maven

 2. get test source code using wget

wget https://s3.amazonaws.com/hadoopkr/source.tar.gz

 3. build source with mvn

cd /home/<user>/source/<where pom.xml>
mvn compile

5. upload local file to hadoop

hadoop fs -copyFromLocal README.txt /

 6. execute on hadoop

hadoop jar <jar file name> wordcount /README.txt /output_wordcount

About Map Reduce Code


 1. Hadoop In/Out

k1,v1 => mapper => k2,v2 => reducer =>k3,v3

Mapper<LongWritable, Text, Text, LongWritable>
Reducer<Text, LongWritable, Text, LongWritable>

 (A) Plain Text => Key Value

if the input content is key/value already this process is not needed
good example of key/value type data is CSV(comma seperated value)
when the input value is plain text we need to attach key(which is meaningless)
char count / line by line content

[Plain Text]
Let’s say bellow example is a plain text

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;

[Key/Value]
Covert the plain text to key/value type result will be like bellow
key : character count / value : line by line text contents

0, import org.apache.hadoop.fs.Path;
50, import org.apache.hadoop.conf.*;
90, import org.apache.hadoop.io.*;

 (B) Mapper:  Key(count)/Value(content)  =>   Key(word)/Value(count)

Mapper called as numbers of lines of input content (input is char count/ line and output is line text) , output will be key(word), value(count).
on the mapper we need to specify the delimeter(comma, space, tab which on to use, set things you wanna remove, space is default delimeter)

[set delimeters]

new StringTokenizer(line, "\t\r\n\f |,.()<>");

[set word to lower case]

word.set(tokenizer.nextToken().toLowerCase());

[mapper class]

public class WordCount {

 public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    private final static LongWritable one = new LongWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line, "\t\r\n\f |,.()<>");
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken().toLowerCase());
            context.write(word, one);
        }
    }
 }

output will be like bellow (for example)

[IN – data example]

0, import org.apache.hadoop.fs.Path;
50, import org.apache.hadoop.conf.*;
90, import org.apache.hadoop.io.*;

[OUT – data example]

import , 1
org , 1
apache, 1
hadoop, 1

 (B) Mapper:  Key(word)/Value(iter count)  =>   Key(word)/Value(count)

what reduce do is sum the count, times of reducer called is number of words

[IN – data example]

import , [1,1,1,1,1]
org , [1,1,1]
apache, [1,1]
hadoop, [1,1,1,1]

[OUT – data example]

import , 5
org , 3
apache, 2
hadoop, 4

[reducer code]

public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
   private LongWritable sumWritable = new LongWritable();

   public void reduce(Text key, Iterable<LongWritable> values, Context context)
     throws IOException, InterruptedException {
       long sum = 0;
       for (LongWritable val : values) {
           sum += val.get();
       }
       sumWritable.set(sum);
       context.write(key, sumWritable);
   }
}

 

 

 

Install Hadoop on Docker

  • Get Ubuntu Docker

    – docker pull ubuntu

  • Start Container
docker run -i -p 22 -p 8000:80 -m /data:/data -t <ubuntu> /bin/bash
  • Install Jdk
    sudo add-apt-repository ppa:openjdk-r/ppa  
    sudo apt-get update   
    sudo apt-get install openjdk-7-jre
  • .bashrc
    export JAVA_HOME=/usr/lib/jvm/...
    export CLASSPATH=$JAVA_HOME/lib/*:.
    export PATH=$PATH:$JAVA_HOME/bin
    
  • HADOOP 1.2.1 install

    download hadoop and unpack

    root@4aa2cda88fcc:/home/kim# wget http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
    root@4aa2cda88fcc:/home/kim# mv ./hadoop-1.2.1.tar.gz /home/user
    root@4aa2cda88fcc:/home/kim# tar xvzf hadoop-1.2.1.tar.gz
  • SET Configuration

    set configuration on ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export CLASSPATH=$JAVA_HOME/lib/*:.
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/home/kim/hadoop-1.2.1
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
  • set HADOOP conf
vi  /home/hadoop-1.2.1/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
vi mapred-site.xml

<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
 </property>
</configuration>
vi hdfs-site.xml


<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
 </property>
</configuration>
vi core-site.xml


<configuration>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://4aa2cda88fcc:9000</value>
 </property>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/home/kim/temp</value>
 </property>
</configuration>
  • SET SSH
# apt-get install openssh-server
# ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
# vi /etc/ssh/sshd_config 
# ==> PermitRootLogin yes
service ssh restart
# ssh localhost
  • Format namenode
hadoop namenode -format
  •  start & stop Shell script
start-all.sh  (start-dfs.sh + start-mapred.sh) 

# check java apllication is running 
jps 

stop-all.sh
  • search hadoop files
hadoop fs -ls /
  • upload local file to hadoop
hadoop fs -copyFromLocal README.txt /
  • Execute word count
hadoop jar hadoop-examples-1.2.1.jar wordcount /README.txt /output_wordcount

working on find way to use “docker exec -ti containername sh"

EC2-Ubuntu-vsftpd(FTP)

1. 설치 STEP
 1. install : sudo apt-get install vsftpd
 2. EC2 inbound : 1024 ~ 1048 , 21 , 20
 3. config : vi /etc/vsftpd/vsftpd.conf 
 4. User Ban List : /etc/vsftpd/user_list
 5. Start : sudo service vsftpd start
2. vsfpd.conf 변경
#vsftpd.conf
anonymous enable = NO
pasv_enable=YES
pasv_min_port=1024
pasv_max_port=1048
pasv_address=public ip (not dns) 
#chroot_local_user=YES

 

AWS – EC2 – Ubuntu – XRDP 환경 구성

집에서 서버를 구축해서 사용을 해보면 SK, KT 등 통신사를 통하기 때문에
고정 IP 를 사용할 수 없다던지, Proxy 서버에 차단된다던지 하는 어려움도
존재하고 전기비도 생각하는 것보다 엄청 많이 나오는 관계로 AWS사용을
적극 검토하게 되었다. 오늘 작성하려는 내용은 AWS 에 EC2 서비스를 사용
Ubuntu 를 설치하고 Terminal 접속만 되는 문제를 해결하고자 XRDP 를
설치하여 접속하는 것 까지 작성하겠다.
RedHat 은 http://devopscube.com/how-to-setup-gui-for-amazon-ec2-rhel-7-instance/

1. Google 에서 AWS 를 검색하여 아래 페이지에 접속 후 계정을 생성하여
로그인 한다.

K-093

2. 아래와 같이 서비스 목록이 나오는데 우상단에 서비스 지역을 Seoul로 꼭 바꾸고
EC2 서비스를 실행한다.
K-096

3. Launch Instance Click

K-097

 

4.  자신이 원하는 Platform 을 선택하면 된다. 여기서는 Ubuntu 선택
K-098

 

5. CPU 수와 RAM 등을 지정한다. (t2.micro 는 1년간 무료이다 )

K-100

6.  Network 설정 Default 로 가자
K-101

7. HDD 용량이다 최대 30G 까지 설정 가능
K-102

8. 이하 Defulat로 설정 – Review and Launch
K-105

9. Key 생성 (꼭해야 함. 아무 이름이나 원하는 것 입력)

K-107
10. Instnace 가 생성된 것을 볼 수 있다.
K-108

11. 접속을 위해서 PUTTY 와 PUTTYGEN 을 다운로드 받는다.
K-112

12. Puttygen 을 실행하여 (10)에서 생성한 pem 파일을 ppk 로 변환
– LOAD -> PEM 선택 -> save private key -> 동일 이름으로 저장
K-113

※ 고정IP를 사용하고 싶을때는
VPC > Elastic IPs > Allocate New Address > Action > Associate Address > Instance 선택 을 해주면 해당 Instacne 는 항상 Elastic IPs 로 접속된다.
Elastic IPs 의 경우 연결된 Instacne 없이 방치 되면 요금이 부과된다.

13. putty 를 이용해서 instacne 에 접속한다.
– Instance 이름을 ubuntu 로 입력한다.
– Connection 주소는 instance 명 @ public DNS 이다

K-117 K-120

14. 아까 만든 pem 파일을 SSH -> Auth 에서 Load 한다
K-120

15. (14)까지 완료 후 접속하면 아래와 같이 접속이 된다.

K-121

 

16. 이제 XRDP 를 통해 윈도우에서 원격 접속할 수 있는 환경을 만들어 보자
K-126

 

17. 아래 명령어를 통해 apt-get 을 업그레이드 한다
sudo apt-get update
sudo apt-get upgrade

18. 아래 파일을 편집한다
sudo vim /etc/ssh/sshd_config
PasswordAuthentication  내용을 yes 로 변경

19. 변경 파일 즉시 적용
sudo /etc/init.d/ssh restart

20. 아까 만들었던 ubuntu 계정의 비밀번호 설정
sudo –i //superuser 권한 임시 획득
passwd ubuntu

21. ubuntu 계정으로 다시 전환
su ubuntu
cd

22. ubuntu desktop 을 설치 합니다.
export DEBIAN_FRONTEND=noninteractive
sudo -E apt-get update
sudo -E apt-get install -y ubuntu-desktop

23. XRDP 와 xfce4  를 설치 합니다.
sudo apt-get install xfce4 xrdp
sudo apt-get install xfce4 xfce4-goodies

24. xfce4 가 기본 메니져가 되도록 설정합니다 .
echo xfce4-session > ~/.xsession

25. ubuntu user계정에 복사 합니다. (계정 추가시 동일 행위 필요)
sudo cp /home/ubuntu/.xsession /etc/skel

26. xrdp.ini 를 편집합니다.
sudo vim /etc/xrdp/xrdp.ini

27. xrdp.ini 파일내  xrdp1 그룹의 아래 내용을 수정합니다.
port=-1
– to –
port=ask-1

28. 서비스를 restart 합니다.

sudo service xrdp restart

29. RDP 연결의 기본 PORT 는 3389 이며, Rule 에서 별도로 허용 처리 필요
※ 포트를 변경하고자 할 경우
sudo vi /etc/xrdp/xrdp.ini
[globals]
port=3389    
//해당 값 변경
sudo service xrdp restart

※ 방화벽 OPEN
sudo firewall-cmd –permanent –add-port=3389/tcp
sudo firewall-cmd –reload

K-127

K-128

 

30. 원격접속을 시도해 봅니다.
K-129 K-130

K-131

 

여기까지 기본 환경은 구성하였으며, 이에 이 위에 Git 서버 , WordPress 등 유틸성격의 서비스를 설치하고  Spark , Hadoop, Eclipse, Tomcat, DeepLearning4Java 등을 설치하여 실제 서비스를 올리는 부분은 정리가 되면 업데이트 하겠습니다.

Multi Router DMZ setting (use your computer as sever)

  1. General Router Info
    – SK Broad Band (Mercury) Router  (MMC)
    . IP : http://192.168.25.1
    . ID : admin
    . PASS : last six digit of MAC Address + _admin
    – IPTime Router (DVW)
    . IP : http://192.168.0.1
    . ID/PASS : All yours
  2. Find you Server PC IP
    – Terminal >> Type “ip addr show”
  3. Login to DVW Router
    – find DMZ setting
    . set (2) IP on DMZ setting
    – find out reach IP of DVW Router
  4. Login to MMC Router
    – find DMZ setting
    . set (3) IP on DMZ setting

done .

Ubuntu Spark/R 설치

가. Spark 설치
(1) http://spark.apache.org/downloads.html 접속
(2) 기 구축된 Hadoop 환경이 있는 것이 아니라면 Hadoop Pre-Build 선택
(3) download Spark
(4) 압축 해제 tar -zxvf spark-1.6.1-bin-hadoop.2.6.tgz

나. Spark 실행

[커맨드 모드]
(1) spark-1.6.1-bin-hadoop2.6/bin$ ./pyspark
(2) Spark 모니터링
16/06/01 22:03:46 INFO SparkUI: Started SparkUI at http://192.168.0.3:4040
선택 영역_002

[Master Node]
/sbin/
start-master.sh  

※ 아래 Page 기본 접속 포트는 8080 이다. (이미 사용중일 경우 +1)

선택 영역_001

[Slave  Node]
root@kim:/home/kim/spark/spark-1.6.1-bin-hadoop2.6/bin# ./spark-class org.apache.spark.deploy.worker.Worker spark://kim:7077
선택 영역_003

선택 영역_005

[Master Node 에서 slave 인식 확인]

선택 영역_004

다. R 설치
(1) root 권한 설정 : sudo passwd root
(2) Super user Login : su
(3) https://www.rstudio.com/products/rstudio/download-server-2/ 참조
$ sudo apt-get install r-base
$ sudo apt-get install gdebi-core
$ wget https://download2.rstudio.org/rstudio-server-0.99.902-amd64.deb
$ sudo gdebi rstudio-server-0.99.902-amd64.deb

라. R 실행
(1) http:// IP : 8787
(2) 접속 계정 : Linux 계정

마. R – Spark Cluster실행

(1) SPARK_HOME 설정

root@kim:/home/kim/spark/spark-1.6.1-bin-hadoop2.6# export SPARK_HOME=/home/kim/spark/spark-1.6.1-bin-hadoop2.6
root@kim:/home/kim/spark/spark-1.6.1-bin-hadoop2.6# echo “$SPARK_HOME”
/home/kim/spark/spark-1.6.1-bin-hadoop2.6

(2) R 에서 Spark Lib Load

if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
  Sys.setenv(SPARK_HOME = "/home/kim/spark/spark-1.6.1-bin-hadoop2.6")
}

Sys.getenv("SPARK_HOME")

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
##sc <- sparkR.init(master="spark://192.168.0.3:7077")

다음의 패키지를 부착합니다: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

    colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform

(3) 로컬로 Context 생성

sc <- sparkR.init(master="local[*]",appName='test', sparkEnvir=list(spark.executor.memory='2g'))

(4) Remote context 생성

선택 영역_006

 

[전체 테스트 코드]

if (nchar(Sys.getenv(“SPARK_HOME”)) < 1) {
Sys.setenv(SPARK_HOME = “/home/kim/spark/spark-1.6.1-bin-hadoop2.6”)
}

Sys.getenv(“SPARK_HOME”)

library(SparkR, lib.loc = c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”)))

sc <- sparkR.init(master=”spark://kim:7077″,appName=’test’, sparkEnvir=list(spark.executor.memory=’500m’),
sparkPackages=”com.databricks:spark-csv_2.11:1.0.3″)

sqlContext <- sparkRSQL.init(sc)

df <- createDataFrame(sc, faithful)
head(df)

people <- read.df(sqlContext, “/home/kim/spark/spark-1.6.1-bin-hadoop2.6/examples/src/main/resources/people.json”, “json”)
head(people)

sparkR.stop()

———결과———–

  age    name
1  NA Michael
2  30    Andy
3  19  Justin

————————–