使用IDEA+Maven实现MapReduce的WordCount功能

it2025-03-04 43

环境及工具要求：

（以笔者为例） jdk 1.8（Windows环境下） Maven 3.3.9（Windows环境下） IDEA 2017+（Windows环境下） WinScp（Windows环境下） hadoop 2.7.3（Linux环境下）

一、配置IDEA的Maven

1.file—>new—>project 2.新窗口打开（注意修改IDEA默认的maven路径）修改pom.xml文件添加下列代码：

<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.1.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">  <mainClass>com.MyWordCount.WordCountMain</mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>7</source> <target>7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.3</version> </dependency> </dependencies>

完成后如图：特别注意这里，com.MyWordCount为包名，WordCountMain为主类名，在后续的新建Java类时，注意修改为自己所创建的。

二、创建Java文件

1.新建包 2.新建 WordCountMapper.java 代码：

package com.MyWordCount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; // 泛型 k1 v1 k2 v2 public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ @Override protected void map(LongWritable key1, Text value1, Context context) throws IOException, InterruptedException { //数据： I like MapReduce String data = value1.toString(); //分词：按空格来分词 String[] words = data.split(" "); //输出 k2 v2 for(String w:words){ context.write(new Text(w), new IntWritable(1)); } } }

2.新建WordCountReducer.java 代码：

package com.MyWordCount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; // k3 v3 k4 v4 public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text k3, Iterable<IntWritable> v3,Context context) throws IOException, InterruptedException { //对v3求和 int total = 0; for(IntWritable v:v3){ total += v.get(); } //输出 k4 单词 v4 频率 context.write(k3, new IntWritable(total)); } }

3.新建WordCountMain.java 代码：

package com.MyWordCount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountMain { public static void main(String[] args) throws Exception { //1.创建一个job和任务入口 Job job = Job.getInstance(new Configuration()); job.setJarByClass(WordCountMain.class); //main方法所在的class //2.指定job的mapper和输出的类型<k2 v2> job.setMapperClass(WordCountMapper.class);//指定Mapper类 job.setMapOutputKeyClass(Text.class); //k2的类型 job.setMapOutputValueClass(IntWritable.class); //v2的类型 //3.指定job的reducer和输出的类型<k4 v4> job.setReducerClass(WordCountReducer.class);//指定Reducer类 job.setOutputKeyClass(Text.class); //k4的类型 job.setOutputValueClass(IntWritable.class); //v4的类型 //4.指定job的输入和输出 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); //5.执行job job.waitForCompletion(true); } }

三、运行Maven命令打jar包

1.使用IDEA terminal 运行Maven命令 2.打包成功 3.刷新target目录

四、上传jar包到配置有hadoop的Linux环境中

1.使用工具上传（推荐使用WinScp）

五、执行MapReduce任务

1.首先查看jar包是否上传成功 2.在Linux的~目录下新建一个ceshi.txt文件，输入测试内容 3.将文件上传至hdfs的根目录下查看是否上传成功：

hdfs dfs -ls /

4.执行MapReduce任务输入：

hadoop jar MyWordCount-1.0-SNAPSHOT.jar /ceshi.txt /ceshioutput

运行成功部分截图 5.查看执行命令后输出的结果命令：

hdfs dfs -cat /ceshioutput/part-r-00000

输出结果完成了对文本中不同单词出现次数的统计

六、小节

WordCount的需求可以概括为：对程序设计语言源文件统计字符数、单词数、行数，统计结果以指定格式输出到默认文件中，以及其他扩展功能，并能够快速地处理多个文件。

最新回复(0)