Skip to main content

Hadoop Yarn MR(MapReduce) streaming using Shell script part 2


Friends,
This is a streaming MapReduce job (shell script) that reads any text input and computes the average length of all words that start with each character .

---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_mpr.sh
#! /bin/bash
while  read  line
do
 for word in `echo $line`
 do
    c=`expr substr $word 1 1`
    l=`expr length $word`
    echo $c $l

 done
   
done
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_rdr.sh
#! /bin/bash

old=''
new=''
val=''
cnt=1
sum=0
avg=0
start=0
while  read  line
do
new=`echo $line|cut -d' ' -f1`
val=`echo $line|cut -d' ' -f2`

if [ "$old" != "$new" ]; then
[ $start -ne 0 ] && echo -e "$old\t$avg"
    start=1
    old=$new
        cnt=1
        sum=$val
        avg=`echo "scale=2; $sum/$cnt"|bc`
else
    cnt=$(($cnt+1))
        sum=$(($sum+$val))
        avg=`echo "scale=2; $sum/$cnt"|bc`

fi
done
    echo -e "$old\t$avg"

---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat sample.txt
No now is definitely not the time
No now is definitely not the time
N
NNNNN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -put sample.txt  /user/training/sample2.txt
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -input /user/training/sample2.txt -output /user/training/testlog/output4 -file avg_ln_rdr.sh -file avg_ln_mpr.sh -reducer avg_ln_rdr.sh -mapper avg_ln_mpr.sh

15/02/26 02:53:32 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [avg_ln_rdr.sh, avg_ln_mpr.sh] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.3.0-cdh5.1.0.jar] /tmp/streamjob3697911235081163262.jar tmpDir=null
15/02/26 02:53:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:37 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:38 INFO mapred.FileInputFormat: Total input paths to process : 1
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: number of splits:2
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1424947526534_0001
15/02/26 02:53:40 INFO impl.YarnClientImpl: Submitted application application_1424947526534_0001
15/02/26 02:53:40 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1424947526534_0001/
15/02/26 02:53:40 INFO mapreduce.Job: Running job: job_1424947526534_0001
15/02/26 02:54:03 INFO mapreduce.Job: Job job_1424947526534_0001 running in uber mode : false
15/02/26 02:54:03 INFO mapreduce.Job:  map 0% reduce 0%
15/02/26 02:54:29 INFO mapreduce.Job:  map 50% reduce 0%
15/02/26 02:54:31 INFO mapreduce.Job:  map 100% reduce 0%
15/02/26 02:54:52 INFO mapreduce.Job:  map 100% reduce 100%
15/02/26 02:54:52 INFO mapreduce.Job: Job job_1424947526534_0001 completed successfully
15/02/26 02:54:52 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=120
        FILE: Number of bytes written=286100
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=332
        HDFS: Number of bytes written=36
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=48797
        Total time spent by all reduces in occupied slots (ms)=18844
        Total time spent by all map tasks (ms)=48797
        Total time spent by all reduce tasks (ms)=18844
        Total vcore-seconds taken by all map tasks=48797
        Total vcore-seconds taken by all reduce tasks=18844
        Total megabyte-seconds taken by all map tasks=49968128
        Total megabyte-seconds taken by all reduce tasks=19296256
    Map-Reduce Framework
        Map input records=4
        Map output records=16
        Map output bytes=82
        Map output materialized bytes=126
        Input split bytes=218
        Combine input records=0
        Combine output records=0
        Reduce input groups=8
        Reduce shuffle bytes=126
        Reduce input records=16
        Reduce output records=5
        Spilled Records=32
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=743
        CPU time spent (ms)=2750
        Physical memory (bytes) snapshot=556838912
        Virtual memory (bytes) snapshot=2545086464
        Total committed heap usage (bytes)=378208256
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=114
    File Output Format Counters
        Bytes Written=36
15/02/26 02:54:52 INFO streaming.StreamJob: Output directory: /user/training/testlog/output4
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -cat  /user/training/testlog/output4/part-00000
N    2.50
d    10.00
i    2.00
n    3.00
t    3.50

Done. :)
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 dhanooj.world@gmail.com

Comments

Popular posts from this blog

Installing Hadoop ecosystim in pseudo mode in UBUNTU 12.04 LTS

Changes Done in UBUNTU 12.04 64-bit : Post Installation 1. Access To Root:     eagroup@BI-Lab:~$ sudo su     [sudo] password for eagroup:     root@BI-Lab:/home/eagroup# sudo passwd     Enter new UNIX password: password     Retype new UNIX password: password     passwd: password updated successfully     root@BI-Lab:/home/eagroup# 2.  Add and manage users and groups:     a. Add user from GUI - hduser,     Username     Password    Privilage     root        password    root     eagroup        password    admin     hduser        hduser        hadoop user 3. Making hduser sudoer:     a. login as root   ...

Free easy Twitter sentiment analysis using R console (r-studio)in ubuntu(linux)

Twitter sentiment analysis using r-studio (r console) in ubuntu 1)  login as user su 2) Download  all files from following link: https://drive.google.com/folderview?id=0B1WeP8XHW0OzcEY2TEtwMlZDTmc&usp=sharing#list keep these files in /home/hduser/sentiment ( sample path) positive-words.txt negative-words.txt sentiment.r 3) pre-requisites open  /etc/apt/sources.list and add deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu raring/ sudo apt-get install r-base -- now we have R Console in ubuntu. $ cd /home/hduser/sentiment/ call "R" from command prompt $ R -- Sett working directory(wd) and libraries required for analysis. setwd("/home/hduser/sentiment") install.packages('twitteR') install.packages("ROAuth") install.packages("RCurl") install.packages("plyr") install.packages("stringr") --  it's not required since we have this file. if u want u can download it  b...

how to get hive table size from metastore mysql

select    d.name  as db_name ,t.tbl_name     as tbl_name ,from_unixtime(min(t.create_time))   as create_time ,min(t.owner)          as owner ,min(case when tp.param_key = 'COLUMN_STATS_ACCURATE'  then tp.param_value                 end) as COLUMN_STATS_ACCURATE ,min(case when tp.param_key = 'last_modified_by'       then tp.param_value                 end) as last_modified_by ,min(case when tp.param_key = 'last_modified_time'     then from_unixtime(tp.param_value)  end) as last_modified_time  ,min(case when tp.param_key = 'numFiles'               then tp.param_value                 end) as numFiles ,min(case when tp.param_key = 'numRows'                th...