Hadoop Yarn MR(MapReduce) streaming using Shell script part 2

Friends,
This is a streaming MapReduce job (shell script) that reads any text input and computes the average length of all words that start with each character .

---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_mpr.sh
#! /bin/bash
while read line
do
for word in `echo $line`
do
    c=`expr substr $word 1 1`
    l=`expr length $word`
    echo $c $l

done

done
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_rdr.sh
#! /bin/bash

old=''
new=''
val=''
cnt=1
sum=0
avg=0
start=0
while read line
do
new=`echo $line|cut -d' ' -f1`
val=`echo $line|cut -d' ' -f2`

if [ "$old" != "$new" ]; then
[ $start -ne 0 ] && echo -e "$old\t$avg"
    start=1
    old=$new
        cnt=1
        sum=$val
        avg=`echo "scale=2; $sum/$cnt"|bc`
else
    cnt=$(($cnt+1))
        sum=$(($sum+$val))
        avg=`echo "scale=2; $sum/$cnt"|bc`

fi
done
    echo -e "$old\t$avg"

---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat sample.txt
No now is definitely not the time
No now is definitely not the time
N
NNNNN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -put sample.txt /user/training/sample2.txt
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -input /user/training/sample2.txt -output /user/training/testlog/output4 -file avg_ln_rdr.sh -file avg_ln_mpr.sh -reducer avg_ln_rdr.sh -mapper avg_ln_mpr.sh

15/02/26 02:53:32 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [avg_ln_rdr.sh, avg_ln_mpr.sh] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.3.0-cdh5.1.0.jar] /tmp/streamjob3697911235081163262.jar tmpDir=null
15/02/26 02:53:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:37 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:38 INFO mapred.FileInputFormat: Total input paths to process : 1
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: number of splits:2
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1424947526534_0001
15/02/26 02:53:40 INFO impl.YarnClientImpl: Submitted application application_1424947526534_0001
15/02/26 02:53:40 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1424947526534_0001/
15/02/26 02:53:40 INFO mapreduce.Job: Running job: job_1424947526534_0001
15/02/26 02:54:03 INFO mapreduce.Job: Job job_1424947526534_0001 running in uber mode : false
15/02/26 02:54:03 INFO mapreduce.Job: map 0% reduce 0%
15/02/26 02:54:29 INFO mapreduce.Job: map 50% reduce 0%
15/02/26 02:54:31 INFO mapreduce.Job: map 100% reduce 0%
15/02/26 02:54:52 INFO mapreduce.Job: map 100% reduce 100%
15/02/26 02:54:52 INFO mapreduce.Job: Job job_1424947526534_0001 completed successfully
15/02/26 02:54:52 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=120
        FILE: Number of bytes written=286100
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=332
        HDFS: Number of bytes written=36
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=48797
        Total time spent by all reduces in occupied slots (ms)=18844
        Total time spent by all map tasks (ms)=48797
        Total time spent by all reduce tasks (ms)=18844
        Total vcore-seconds taken by all map tasks=48797
        Total vcore-seconds taken by all reduce tasks=18844
        Total megabyte-seconds taken by all map tasks=49968128
        Total megabyte-seconds taken by all reduce tasks=19296256
    Map-Reduce Framework
        Map input records=4
        Map output records=16
        Map output bytes=82
        Map output materialized bytes=126
        Input split bytes=218
        Combine input records=0
        Combine output records=0
        Reduce input groups=8
        Reduce shuffle bytes=126
        Reduce input records=16
        Reduce output records=5
        Spilled Records=32
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=743
        CPU time spent (ms)=2750
        Physical memory (bytes) snapshot=556838912
        Virtual memory (bytes) snapshot=2545086464
        Total committed heap usage (bytes)=378208256
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=114
    File Output Format Counters
        Bytes Written=36
15/02/26 02:54:52 INFO streaming.StreamJob: Output directory: /user/training/testlog/output4
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -cat /user/training/testlog/output4/part-00000
N    2.50
d    10.00
i    2.00
n    3.00
t    3.50

Done. :)
---------------------------------------------------------------------------------------------------------------------------------------------------------------
dhanooj.world@gmail.com

DhansWorld

Search This Blog

Hadoop Yarn MR(MapReduce) streaming using Shell script part 2

Comments

Post a Comment

Popular posts from this blog

how to get hive table size from metastore mysql

MySQL replication - Master Slave Easy way with Crash test sample