Friends,
This is a streaming MapReduce job (shell script) that reads any text input and computes the average length of all words that start with each character .
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_mpr.sh
#! /bin/bash
while read line
do
for word in `echo $line`
do
c=`expr substr $word 1 1`
l=`expr length $word`
echo $c $l
done
done
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat avg_ln_rdr.sh
#! /bin/bash
old=''
new=''
val=''
cnt=1
sum=0
avg=0
start=0
while read line
do
new=`echo $line|cut -d' ' -f1`
val=`echo $line|cut -d' ' -f2`
if [ "$old" != "$new" ]; then
[ $start -ne 0 ] && echo -e "$old\t$avg"
start=1
old=$new
cnt=1
sum=$val
avg=`echo "scale=2; $sum/$cnt"|bc`
else
cnt=$(($cnt+1))
sum=$(($sum+$val))
avg=`echo "scale=2; $sum/$cnt"|bc`
fi
done
echo -e "$old\t$avg"
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ cat sample.txt
No now is definitely not the time
No now is definitely not the time
N
NNNNN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -put sample.txt /user/training/sample2.txt
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -input /user/training/sample2.txt -output /user/training/testlog/output4 -file avg_ln_rdr.sh -file avg_ln_mpr.sh -reducer avg_ln_rdr.sh -mapper avg_ln_mpr.sh
15/02/26 02:53:32 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [avg_ln_rdr.sh, avg_ln_mpr.sh] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.3.0-cdh5.1.0.jar] /tmp/streamjob3697911235081163262.jar tmpDir=null
15/02/26 02:53:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:37 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/26 02:53:38 INFO mapred.FileInputFormat: Total input paths to process : 1
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: number of splits:2
15/02/26 02:53:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1424947526534_0001
15/02/26 02:53:40 INFO impl.YarnClientImpl: Submitted application application_1424947526534_0001
15/02/26 02:53:40 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1424947526534_0001/
15/02/26 02:53:40 INFO mapreduce.Job: Running job: job_1424947526534_0001
15/02/26 02:54:03 INFO mapreduce.Job: Job job_1424947526534_0001 running in uber mode : false
15/02/26 02:54:03 INFO mapreduce.Job: map 0% reduce 0%
15/02/26 02:54:29 INFO mapreduce.Job: map 50% reduce 0%
15/02/26 02:54:31 INFO mapreduce.Job: map 100% reduce 0%
15/02/26 02:54:52 INFO mapreduce.Job: map 100% reduce 100%
15/02/26 02:54:52 INFO mapreduce.Job: Job job_1424947526534_0001 completed successfully
15/02/26 02:54:52 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=120
FILE: Number of bytes written=286100
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=332
HDFS: Number of bytes written=36
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=48797
Total time spent by all reduces in occupied slots (ms)=18844
Total time spent by all map tasks (ms)=48797
Total time spent by all reduce tasks (ms)=18844
Total vcore-seconds taken by all map tasks=48797
Total vcore-seconds taken by all reduce tasks=18844
Total megabyte-seconds taken by all map tasks=49968128
Total megabyte-seconds taken by all reduce tasks=19296256
Map-Reduce Framework
Map input records=4
Map output records=16
Map output bytes=82
Map output materialized bytes=126
Input split bytes=218
Combine input records=0
Combine output records=0
Reduce input groups=8
Reduce shuffle bytes=126
Reduce input records=16
Reduce output records=5
Spilled Records=32
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=743
CPU time spent (ms)=2750
Physical memory (bytes) snapshot=556838912
Virtual memory (bytes) snapshot=2545086464
Total committed heap usage (bytes)=378208256
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=114
File Output Format Counters
Bytes Written=36
15/02/26 02:54:52 INFO streaming.StreamJob: Output directory: /user/training/testlog/output4
---------------------------------------------------------------------------------------------------------------------------------------------------------------
$ hadoop fs -cat /user/training/testlog/output4/part-00000
N 2.50
d 10.00
i 2.00
n 3.00
t 3.50
Done. :)
---------------------------------------------------------------------------------------------------------------------------------------------------------------
dhanooj.world@gmail.com
Comments
Post a Comment