搭建Hadoop集群

it2025-08-24  1

本文介绍如何在VMware虚拟机下使用Ubuntu18.04系统搭建Hadoop集群,搭建伪分布式和全分布式两种。

软件版本如下:

Ubuntu-18.04.2-desktop-amd64Hadoopopenjdk

pre1. 参考官方文档确定Hadoop和jdk的安装版本。

pre2. 创建用户名为Hadoop的用户,密码为hadoop(不强制),但是在搭建集群时不要使用root用户。

搭建Hadoop和Java环境

sudo -i # 以root身份安装 cd /home/hadoop wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz tar -zxf hadoop-3.2.1.tar.gz mv ./hadoop-3.2.1 hadoop mv ./hadoop /usr/local/ cd /usr/local chown -R hadoop:hadoop ./hadoop # 更新文件夹所有者 cd hadoop/ apt-get install -y openjdk-8-jdk openjdk-8-jre # 如果此处报错先执行 apt-get update

配置环境变量。

cd /usr/local/hadoop/etc/hadoop vi hadoop-env.sh

在结尾处添加如下内容。

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

继续配置环境变量。

vi /etc/bash.bashrc

在结尾处添加如下内容。

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

保存退出,执行如下代码,使环境变量生效。

source /etc/bash.bashrc

完成后开启新终端,执行hadoop version,显示如下代码即为成功。

Hadoop 3.2.1 Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842 Compiled by rohithsharmaks on 2019-09-10T15:56Z Compiled with protoc 2.5.0 From source with checksum 776eaf9eee9c0ffc370bcbc1888737 This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar

以单机模式运行

(在新终端下运行)

cd /usr/local/hadoop mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+' cat output/*

显示1 dfsadmin,则已完成单机mapreduce

安装SSH

分布式的hadoop使用ssh通信,所以需要安装ssh。

sudo apt-get install openssh-server -y

 

配置伪分布式

修改core-site.xml和hdfs-site.xml两个配置文件

cd /usr/local/hadoop/etc/hadoop

core-site.xml

<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>更改临时目录以防止系统自动清空临时目录内的hadoop相关内容</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>

hdfs-site.xml

<configuration> <property> <name>dfs.replication</name> <value>1</value> <description>数据块的副本数,在一台机器存放多副本是没有意义的</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> <description>本地磁盘用于存放fsimage文件的目录</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> <description>本地磁盘用于存放HDFS的block的目录</description> </property> </configuration>

设置ssh免密登录

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys

以无密码方式连接本机

ssh localhost

伪分布式运行Hadoop

cd /usr/local/hadoop/ bin/hdfs namenode -format sbin/start-dfs.sh

访问http://localhost:9870/,如果可以正常访问,则搭建成功。

关闭伪分布式

/usr/local/hadoop/sbin/stop-all.sh rm -rf /usr/local/hadoop/tmp # 删除文件,清空缓存

配置全分布式

pre1.  克隆多台虚拟机(至少有Master和Slave两个节点)

pre2. 查看所有机器的IP地址(ifconfig)

pre2.1 假定Master的IP为192.168.141.133

pre2.2 假定Slave1的IP为192.168.141.134

pre2.3 假定Slave1的IP为192.168.141.135

pre2.4 假定Slave1的IP为192.168.141.1

Master部分

修改/etc/hosts文件,在末尾添加如下内容。

192.168.141.133 master 192.168.141.134 slave1 192.168.141.135 slave2 192.168.141.136 slave3 su hadoop cd /usr/local/hadoop/ bin/hdfs namenode -format

如果已经配置过ssh免登录再克隆,不需要下面三条指令。

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys  

分发公钥,可能会要求输入密码。

ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave1 ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave2 ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave3

Slave部分

su hadoop cd /usr/local/hadoop rm -rf tmp

创建ssh密钥

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys

Master中修改配置文件

core-site.xml

<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>更改临时目录以防止系统自动清空临时目录内的hadoop相关内容</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration>

hdfs-site.xml

<configuration> <property> <name>dfs.replication</name> <value>3</value> <description>数据块的副本数</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> <final>true</final> <description>本地磁盘用于存放fsimage文件的目录</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> <description>本地磁盘用于存放HDFS的block的目录</description> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>slave1:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.use.datanode.hostname</name> <value>true</value> <description>Whether datanodes should use datanode hostnames when connecting to other datanodes for data transfer. </description> </property> </configuration>

mapred-site.xml

<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property> </configuration>

yarn-site.xml

<configuration> <property> <name>yarn.resourcemanager.address</name> <value>master:18040</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:18030</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:18088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:18025</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:18141</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>

写入workers

echo slave1 > workers echo slave2 >> workers echo slave3 >> workers

分发到slaves中。

cd /usr/local/hadoop/etc scp -r hadoop/ hadoop@slave1:/usr/local/hadoop/etc scp -r hadoop/ hadoop@slave2:/usr/local/hadoop/etc scp -r hadoop/ hadoop@slave3:/usr/local/hadoop/etc

在master中启动Hadoop。

start-all.sh

在这里看这里看NameNode信息。在DataNodes里看到节点数则配置成功。

最新回复(0)