hadoop + spark 学习笔记

hadoop + spark 学习笔记

以下内容仅为个人学习笔记,不具备权威性,旨在个人学习和理解过程中的记录和总结。内容可能包含个人见解和解释,仅供参考,读者在应用这些信息时应进行独立思考和验证。在正式场景或专业场合引用前,请确保参考权威和官方发布的资料。

服务器环境

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION=”Ubuntu 24.04 LTS”

Linux zj 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

下载 hadoop 3.3.6

1
https://hadoop.apache.org/release/3.3.6.html

Download tar.gz

拷贝 hadoop-3.3.6.tar.gz 到服务器

安装 hadoop

先决条件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Supported Java Versions
Apache Hadoop 3.3 and upper supports Java 8 and Java 11 (runtime only)

$ sudo apt-get install openjdk-11-jdk
$ sudo apt-get install ssh
$ sudo apt-get install pdsh

#关闭防火墙(方便测试)
systemctl stop ufw.service
systemctl disable ufw.service

#创建用户
useradd -r -m -s /bin/bash hadoop
passwd hadoop

export VISUAL=vim
sudo -E visudo

hadoop ALL=(ALL) ALL

解压

1
2
3
sudo tar -xzvf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /usr/local/hadoop
sudo chown -R hadoop:hadoop /usr/local/hadoop

配置JAVA_HOME

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#查看java位置
root@zj:~# ls -l $(which java)
lrwxrwxrwx 1 root root 22 Apr 17 14:23 /usr/bin/java -> /etc/alternatives/java
root@zj:~# ls -lrt /etc/alternatives/java
lrwxrwxrwx 1 root root 43 Apr 17 14:23 /etc/alternatives/java -> /usr/lib/jvm/java-11-openjdk-amd64/bin/java

vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# 修改hadoop-env.sh中的 export JAVA_HOME 如下配置
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64


# 检查运行
/usr/local/hadoop/bin/hadoop
#输入文档表示配置成功

#HADOOP_HOME 环境变量 方便执行命令
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$PATH

配置hadoop (伪分布式)

修改 etc/hadoop/core-site.xml 如下配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoopdata</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>hadoop</value>
</property>
</configuration>

修改 etc/hadoop/hdfs-site.xml 如下配置

1
2
3
4
5
6
7

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

设置 ssh 无密码登录 localhost

1
2
3
4
5
6
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

# 测试登录
ssh localhost

格式化文件系统:

1
$ bin/hdfs namenode -format   #不可多次运行

启动 NameNode 守护程序和 DataNode 守护程序:

1
$ sbin/start-dfs.sh

无报错表示配置成功

hadoop 守护程序日志输出将写入目录(默认为 )。$HADOOP_LOG_DIR``$HADOOP_HOME/logs

浏览 NameNode 的 Web 界面;默认情况下,它位于http://ip:9870/

HDFS 文件浏览 上传 删除

web 界面操作

命令行

1
2
3
hdfs dfs -mkdir xxx
hdfs dfs -put xxx xxx
hdfs dfs -ls xx

YARN 作业配置

etc/hadoop/mapred-site.xml:

1
2
3
4
5
6
7
8
9
10
11

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_{}
etc/hadoop/yarn-site.xml:

1
2
3
4
5
6
7
8
9
10
11

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

启动 ResourceManager 守护程序和 NodeManager 守护程序:

1
start-yarn.sh

浏览 ResourceManager 的 Web 界面;默认情况下,它位于:http://ip:8088/

测试作业

1
hadoop jar hadoop-mapreduce-examples-3.3.6.jar pi 10 50

输出 Estimated value of Pi is 3.16000000000000000000 表示成功

同时web 出现作业记录

安装spark

1
2
3
sudo mkdir /usr/local/spark
sudo mv spark-3.5.1-bin-hadoop3/* /usr/local/spark
sudo chown -R hadoop:hadoop /usr/local/spark

修改配置

spark-env.sh:

1
2
3
4
5
6
7
8
cp  /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
vim /usr/local/spark/conf/spark-env.sh

#加入
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop/

验证

1
2
cd /usr/local/spark
bin/run-example SparkPi

验证 spark on yarn

1
bin/spark-submit --master yarn --deploy-mode client --driver-memory 512m --executor-memory 512m --num-executors 2 --total-executor-cores 2 /usr/local/spark/examples/src/main/python/pi.py 10