本文介绍如何在VMware虚拟机下使用Ubuntu18.04系统搭建Hadoop集群,搭建伪分布式和全分布式两种。
软件版本如下:
Ubuntu-18.04.2-desktop-amd64Hadoopopenjdkpre1. 参考官方文档确定Hadoop和jdk的安装版本。
pre2. 创建用户名为Hadoop的用户,密码为hadoop(不强制),但是在搭建集群时不要使用root用户。
配置环境变量。
cd /usr/local/hadoop/etc/hadoop vi hadoop-env.sh在结尾处添加如下内容。
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/继续配置环境变量。
vi /etc/bash.bashrc在结尾处添加如下内容。
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop保存退出,执行如下代码,使环境变量生效。
source /etc/bash.bashrc完成后开启新终端,执行hadoop version,显示如下代码即为成功。
Hadoop 3.2.1 Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842 Compiled by rohithsharmaks on 2019-09-10T15:56Z Compiled with protoc 2.5.0 From source with checksum 776eaf9eee9c0ffc370bcbc1888737 This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar
(在新终端下运行)
cd /usr/local/hadoop mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+' cat output/*显示1 dfsadmin,则已完成单机mapreduce
分布式的hadoop使用ssh通信,所以需要安装ssh。
sudo apt-get install openssh-server -y
修改core-site.xml和hdfs-site.xml两个配置文件
cd /usr/local/hadoop/etc/hadoopcore-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>更改临时目录以防止系统自动清空临时目录内的hadoop相关内容</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> <description>数据块的副本数,在一台机器存放多副本是没有意义的</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> <description>本地磁盘用于存放fsimage文件的目录</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> <description>本地磁盘用于存放HDFS的block的目录</description> </property> </configuration>以无密码方式连接本机
ssh localhost访问http://localhost:9870/,如果可以正常访问,则搭建成功。
关闭伪分布式
/usr/local/hadoop/sbin/stop-all.sh rm -rf /usr/local/hadoop/tmp # 删除文件,清空缓存pre1. 克隆多台虚拟机(至少有Master和Slave两个节点)
pre2. 查看所有机器的IP地址(ifconfig)
pre2.1 假定Master的IP为192.168.141.133
pre2.2 假定Slave1的IP为192.168.141.134
pre2.3 假定Slave1的IP为192.168.141.135
pre2.4 假定Slave1的IP为192.168.141.1
Master部分
修改/etc/hosts文件,在末尾添加如下内容。
192.168.141.133 master 192.168.141.134 slave1 192.168.141.135 slave2 192.168.141.136 slave3 su hadoop cd /usr/local/hadoop/ bin/hdfs namenode -format如果已经配置过ssh免登录再克隆,不需要下面三条指令。
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys分发公钥,可能会要求输入密码。
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave1 ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave2 ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave3Slave部分
su hadoop cd /usr/local/hadoop rm -rf tmp创建ssh密钥
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keysMaster中修改配置文件
core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>更改临时目录以防止系统自动清空临时目录内的hadoop相关内容</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration>hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>3</value> <description>数据块的副本数</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> <final>true</final> <description>本地磁盘用于存放fsimage文件的目录</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> <description>本地磁盘用于存放HDFS的block的目录</description> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>slave1:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.use.datanode.hostname</name> <value>true</value> <description>Whether datanodes should use datanode hostnames when connecting to other datanodes for data transfer. </description> </property> </configuration>mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property> </configuration>yarn-site.xml
<configuration> <property> <name>yarn.resourcemanager.address</name> <value>master:18040</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:18030</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:18088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:18025</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:18141</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>写入workers
echo slave1 > workers echo slave2 >> workers echo slave3 >> workers分发到slaves中。
cd /usr/local/hadoop/etc scp -r hadoop/ hadoop@slave1:/usr/local/hadoop/etc scp -r hadoop/ hadoop@slave2:/usr/local/hadoop/etc scp -r hadoop/ hadoop@slave3:/usr/local/hadoop/etc在master中启动Hadoop。
start-all.sh在这里看这里看NameNode信息。在DataNodes里看到节点数则配置成功。