基于MapReduce思想,编写WordCount程序。
1、map代码:map_t.py
import sys import re p = re.compile(r'\w+') for line in sys.stdin: ss = line.strip().split(' ') for s in ss: if len(p.findall(s)) < 1: continue s_low = p.findall(s)[0].lower() print s_low + ',' + '1'2、reduce代码:red_t.py
import sys cur_word = None sum = 0 for line in sys.stdin: word, val = line.strip().split(',') if cur_word == None: cur_word = word if cur_word != word: print '%s\t%s' % (cur_word, sum) cur_word = word sum = 0 sum += int(val) print '%s\t%s' % (cur_word, sum)3.建立测试文件
hello my name is spacedong welcome to the spacedong thank you4.shell脚本测试 5.将测试文件上传到hdfs 6、run.sh 代码
HADOOP_CMD="/home/hadoop/hadoop-2.6.5/bin/hadoop" STREAM_JAR_PATH="/home/hadoop/hadoop-2.6.5/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar" INPUT_FILE_PATH_1="/data/The_Man_of_Property.txt" OUTPUT_PATH="/output/wc" $HADOOP_CMD fs -rm -r -skipTrash $OUTPUT_PATH $HADOOP_CMD jar $STREAM_JAR_PATH \ -input $INPUT_FILE_PATH_1 \ -output $OUTPUT_PATH \ -mapper "python map_t.py" \ -reducer "python red_t.py" \ -file ./map_t.py \ -file ./red_t.py7.运行run.sh 8、将wc从hdfs上下载 9、查看wc