Skip to main content

Hadoop Yarn MR(MapReduce) streaming using Shell script

Hello friends,
Let's check how to run one simple map reduce program in Linux environment.
It's a word count program.



1. create file words.txt with few words like shown below.

words.txt
--------------------------------
cow india japan
america japan
hindu muslim christian
india cow
america america america
china
india
china pakistan

2. cp words.txt to hdfs (give appropriate path)
hadoop fs -copyFromLocal words.txt /user/cloudera/words.txt

3. create mapper.sh
wc_mapper.sh
--------------------------
#! /bin/bash
while  read line
do
 for  word in $line
 do
    echo  $word 1
 done
done


4.create reducer.sh
wc_reducer.sh
------------------------
#! /bin/bash
cnt=0
old=''
new=''
start=0
while read line
do
new=`echo $line|cut  -d' ' -f1`
if  [ "$new" != "$old" ]; then
[ $start -ne 0 ] && echo -e "$old\t$cnt"
old=$new
cnt=1
start=1
else
cnt=$(( $cnt + 1 ))
fi;
done
echo -e "$old\t$cnt"
  
5. invoke map-reduce  using following command. ( Give proper path)

hadoop jar    /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar  -input /user/cloudera/words.txt  -output /user/cloudera/op_wc  -mapper wc_mapper.sh  -reducer wc_reducer.sh  -file wc_mapper.sh   -file wc_reducer.sh

6. Check file created in hdfs
$ hadoop fs -ls -R  /user/cloudera/op_wc
-rw-r--r--   1 cloudera cloudera          0 2015-02-18 03:27 /user/cloudera/op_wc/_SUCCESS
-rw-r--r--   1 cloudera cloudera         61 2015-02-18 03:27 /user/cloudera/op_wc/part-00000


$ hadoop fs -cat  /user/cloudera/op_wc2/part-00000
america    4
christian    1
cow    2
hindu    1
india    2
japan    2
muslim    1

------------------------------------





Done!!! Enjoy .  :) 

If you get trouble ping me at dhanooj.world@gmail.com






Comments

Popular posts from this blog

Installing Hadoop ecosystim in pseudo mode in UBUNTU 12.04 LTS

Changes Done in UBUNTU 12.04 64-bit : Post Installation 1. Access To Root:     eagroup@BI-Lab:~$ sudo su     [sudo] password for eagroup:     root@BI-Lab:/home/eagroup# sudo passwd     Enter new UNIX password: password     Retype new UNIX password: password     passwd: password updated successfully     root@BI-Lab:/home/eagroup# 2.  Add and manage users and groups:     a. Add user from GUI - hduser,     Username     Password    Privilage     root        password    root     eagroup        password    admin     hduser        hduser        hadoop user 3. Making hduser sudoer:     a. login as root   ...

Free easy Twitter sentiment analysis using R console (r-studio)in ubuntu(linux)

Twitter sentiment analysis using r-studio (r console) in ubuntu 1)  login as user su 2) Download  all files from following link: https://drive.google.com/folderview?id=0B1WeP8XHW0OzcEY2TEtwMlZDTmc&usp=sharing#list keep these files in /home/hduser/sentiment ( sample path) positive-words.txt negative-words.txt sentiment.r 3) pre-requisites open  /etc/apt/sources.list and add deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu raring/ sudo apt-get install r-base -- now we have R Console in ubuntu. $ cd /home/hduser/sentiment/ call "R" from command prompt $ R -- Sett working directory(wd) and libraries required for analysis. setwd("/home/hduser/sentiment") install.packages('twitteR') install.packages("ROAuth") install.packages("RCurl") install.packages("plyr") install.packages("stringr") --  it's not required since we have this file. if u want u can download it  b...

how to get hive table size from metastore mysql

select    d.name  as db_name ,t.tbl_name     as tbl_name ,from_unixtime(min(t.create_time))   as create_time ,min(t.owner)          as owner ,min(case when tp.param_key = 'COLUMN_STATS_ACCURATE'  then tp.param_value                 end) as COLUMN_STATS_ACCURATE ,min(case when tp.param_key = 'last_modified_by'       then tp.param_value                 end) as last_modified_by ,min(case when tp.param_key = 'last_modified_time'     then from_unixtime(tp.param_value)  end) as last_modified_time  ,min(case when tp.param_key = 'numFiles'               then tp.param_value                 end) as numFiles ,min(case when tp.param_key = 'numRows'                th...