Skip to main content

Hadoop Yarn MR(MapReduce) streaming using Shell script part 2


Friends,
This is a streaming MapReduce job (shell script) that reads any text input and computes the average length of all words that start with each character .

---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_mpr.sh
#! /bin/bash
while  read  line
do
 for word in `echo $line`
 do
    c=`expr substr $word 1 1`
    l=`expr length $word`
    echo $c $l

 done
   
done
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_rdr.sh
#! /bin/bash

old=''
new=''
val=''
cnt=1
sum=0
avg=0
start=0
while  read  line
do
new=`echo $line|cut -d' ' -f1`
val=`echo $line|cut -d' ' -f2`

if [ "$old" != "$new" ]; then
[ $start -ne 0 ] && echo -e "$old\t$avg"
    start=1
    old=$new
        cnt=1
        sum=$val
        avg=`echo "scale=2; $sum/$cnt"|bc`
else
    cnt=$(($cnt+1))
        sum=$(($sum+$val))
        avg=`echo "scale=2; $sum/$cnt"|bc`

fi
done
    echo -e "$old\t$avg"

---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat sample.txt
No now is definitely not the time
No now is definitely not the time
N
NNNNN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -put sample.txt  /user/training/sample2.txt
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -input /user/training/sample2.txt -output /user/training/testlog/output4 -file avg_ln_rdr.sh -file avg_ln_mpr.sh -reducer avg_ln_rdr.sh -mapper avg_ln_mpr.sh

15/02/26 02:53:32 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [avg_ln_rdr.sh, avg_ln_mpr.sh] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.3.0-cdh5.1.0.jar] /tmp/streamjob3697911235081163262.jar tmpDir=null
15/02/26 02:53:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:37 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:38 INFO mapred.FileInputFormat: Total input paths to process : 1
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: number of splits:2
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1424947526534_0001
15/02/26 02:53:40 INFO impl.YarnClientImpl: Submitted application application_1424947526534_0001
15/02/26 02:53:40 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1424947526534_0001/
15/02/26 02:53:40 INFO mapreduce.Job: Running job: job_1424947526534_0001
15/02/26 02:54:03 INFO mapreduce.Job: Job job_1424947526534_0001 running in uber mode : false
15/02/26 02:54:03 INFO mapreduce.Job:  map 0% reduce 0%
15/02/26 02:54:29 INFO mapreduce.Job:  map 50% reduce 0%
15/02/26 02:54:31 INFO mapreduce.Job:  map 100% reduce 0%
15/02/26 02:54:52 INFO mapreduce.Job:  map 100% reduce 100%
15/02/26 02:54:52 INFO mapreduce.Job: Job job_1424947526534_0001 completed successfully
15/02/26 02:54:52 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=120
        FILE: Number of bytes written=286100
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=332
        HDFS: Number of bytes written=36
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=48797
        Total time spent by all reduces in occupied slots (ms)=18844
        Total time spent by all map tasks (ms)=48797
        Total time spent by all reduce tasks (ms)=18844
        Total vcore-seconds taken by all map tasks=48797
        Total vcore-seconds taken by all reduce tasks=18844
        Total megabyte-seconds taken by all map tasks=49968128
        Total megabyte-seconds taken by all reduce tasks=19296256
    Map-Reduce Framework
        Map input records=4
        Map output records=16
        Map output bytes=82
        Map output materialized bytes=126
        Input split bytes=218
        Combine input records=0
        Combine output records=0
        Reduce input groups=8
        Reduce shuffle bytes=126
        Reduce input records=16
        Reduce output records=5
        Spilled Records=32
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=743
        CPU time spent (ms)=2750
        Physical memory (bytes) snapshot=556838912
        Virtual memory (bytes) snapshot=2545086464
        Total committed heap usage (bytes)=378208256
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=114
    File Output Format Counters
        Bytes Written=36
15/02/26 02:54:52 INFO streaming.StreamJob: Output directory: /user/training/testlog/output4
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -cat  /user/training/testlog/output4/part-00000
N    2.50
d    10.00
i    2.00
n    3.00
t    3.50

Done. :)
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 dhanooj.world@gmail.com

Comments

Popular posts from this blog

how to get hive table size from metastore mysql

select    d.name  as db_name ,t.tbl_name     as tbl_name ,from_unixtime(min(t.create_time))   as create_time ,min(t.owner)          as owner ,min(case when tp.param_key = 'COLUMN_STATS_ACCURATE'  then tp.param_value                 end) as COLUMN_STATS_ACCURATE ,min(case when tp.param_key = 'last_modified_by'       then tp.param_value                 end) as last_modified_by ,min(case when tp.param_key = 'last_modified_time'     then from_unixtime(tp.param_value)  end) as last_modified_time  ,min(case when tp.param_key = 'numFiles'               then tp.param_value                 end) as numFiles ,min(case when tp.param_key = 'numRows'                th...

MySQL replication - Master Slave Easy way with Crash test sample

I expect we have Master and Slave machines having MySQL installed  on both with server-id as 1 and 2 on Master and Slave . Mysql Replication steps: On Master: stop all transactions. mysql> FLUSH TABLES WITH READ LOCK; mysql> show master status ; +---------------+----------+--------------+------------------+ | File          | Position | Binlog_Do_DB | Binlog_Ignore_DB | +---------------+----------+--------------+------------------+ | binlog.000005 |  4913710 |              |                  | +---------------+----------+--------------+------------------+ take mysql dump of Master $ mysqldump -u root -p --all-databases --master-data > dbdump.sql mysql> unlock tables; transfer dump file  to slave host scp dbdump.sql  usr@slave:/tmp/ On Slave: [usr@slave ~]$ ls -ltr -rwx------ 1 usr usr 57319214 Nov  6 06:06 dbdump.sql...