IDEA|IDEA 运行 WordCount 程序 IDEA运行WordCount程序

1. 写在前面目标使用 IDEA 向虚拟机中的 Hadoop 伪分布式集群提交任务，运行 MapReduce 官方示例 WordCount V1.0。
环境说明

Windows 10
IDEA 2020.2.2
CentOS 7.6
Hadoop 2.9.2
Maven 3.6.3
JDK 1.8

2. IDEA 准备 Hadoop 环境安装插件 JetBrains 提供了连接 Hadoop 集群的插件，可以在 IDEA 上连接 HDFS，非常方便。
进入 File -> Setting -> Plugins 中查找 Big Data Tools，Install 安装即可。

文章图片

连接 HDFS

文章图片

【IDEA|IDEA 运行 WordCount 程序】可以选择 Hadoop 的安装路径，不过这是在本机上（win10）安装的 Hadoop。

文章图片

第二种方法就是连接远程的 Hadoop，在测试连接之前需要保证 Hadoop 集群已经启动，这里选择的是第二种。

文章图片

添加依赖注意选择对应的版本

org.apache.logging.log4j log4j-core 2.8.2 org.apache.hadoop hadoop-common 2.9.2 org.apache.hadoop hadoop-client 2.9.2 org.apache.hadoop hadoop-hdfs 2.9.2

3. IDEA 运行 WordCount 程序准备输入文件夹和输出文件夹

$ hadoop fs -cat /demo/wordcount/input/file01 Hello World Bye World $ hadoop fs -cat /demo/wordcount/input/file02 Hello Hadoop Goodbye Hadoop

运行程序
源代码 WordCountV1.0，如果不想每次都手动删除 output 文件夹，添加下面的代码段：

ps：如果 HDFS 存在 output 文件夹，可以先手动删除（文件夹权限问题），这里只是为了能够在 IDEA 上能够重复运行

// 判断 output 文件夹是否存在，如果存在则删除 Path path = new Path(args[1]); FileSystem fileSystem = path.getFileSystem(conf); if (fileSystem.exists(path)) { fileSystem.delete(path, true); }FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1);

设置程序的输入参数，即 args[0] 和 agrs[1]