(以笔者为例) jdk 1.8(Windows环境下) Maven 3.3.9(Windows环境下) IDEA 2017+(Windows环境下) WinScp(Windows环境下) hadoop 2.7.3(Linux环境下)
1.file—>new—>project 2.新窗口打开 (注意修改IDEA默认的maven路径) 修改pom.xml文件 添加下列代码:
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.1.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <!-- main()所在的类,注意修改 --> <mainClass>com.MyWordCount.WordCountMain</mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>7</source> <target>7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.3</version> </dependency> </dependencies>完成后如图: 特别注意这里,com.MyWordCount为包名,WordCountMain为主类名,在后续的新建Java类时,注意修改为自己所创建的。
1.新建包 2.新建 WordCountMapper.java 代码:
package com.MyWordCount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; // 泛型 k1 v1 k2 v2 public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ @Override protected void map(LongWritable key1, Text value1, Context context) throws IOException, InterruptedException { //数据: I like MapReduce String data = value1.toString(); //分词:按空格来分词 String[] words = data.split(" "); //输出 k2 v2 for(String w:words){ context.write(new Text(w), new IntWritable(1)); } } }2.新建WordCountReducer.java 代码:
package com.MyWordCount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; // k3 v3 k4 v4 public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text k3, Iterable<IntWritable> v3,Context context) throws IOException, InterruptedException { //对v3求和 int total = 0; for(IntWritable v:v3){ total += v.get(); } //输出 k4 单词 v4 频率 context.write(k3, new IntWritable(total)); } }3.新建WordCountMain.java 代码:
package com.MyWordCount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountMain { public static void main(String[] args) throws Exception { //1.创建一个job和任务入口 Job job = Job.getInstance(new Configuration()); job.setJarByClass(WordCountMain.class); //main方法所在的class //2.指定job的mapper和输出的类型<k2 v2> job.setMapperClass(WordCountMapper.class);//指定Mapper类 job.setMapOutputKeyClass(Text.class); //k2的类型 job.setMapOutputValueClass(IntWritable.class); //v2的类型 //3.指定job的reducer和输出的类型<k4 v4> job.setReducerClass(WordCountReducer.class);//指定Reducer类 job.setOutputKeyClass(Text.class); //k4的类型 job.setOutputValueClass(IntWritable.class); //v4的类型 //4.指定job的输入和输出 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); //5.执行job job.waitForCompletion(true); } }1.使用IDEA terminal 运行Maven命令 2.打包成功 3.刷新target目录
1.使用工具上传(推荐使用WinScp)
1.首先查看jar包是否上传成功 2.在Linux的~目录下新建一个ceshi.txt文件,输入测试内容 3.将文件上传至hdfs的根目录下 查看是否上传成功:
hdfs dfs -ls /4.执行MapReduce任务 输入:
hadoop jar MyWordCount-1.0-SNAPSHOT.jar /ceshi.txt /ceshioutput运行成功部分截图 5.查看执行命令后输出的结果 命令:
hdfs dfs -cat /ceshioutput/part-r-00000输出结果完成了对文本中不同单词出现次数的统计
WordCount的需求可以概括为:对程序设计语言源文件统计字符数、单词数、行数,统计结果以指定格式输出到默认文件中,以及其他扩展功能,并能够快速地处理多个文件。