文章目录
- 1 安装Hive2.3
- 2 Hive集成引擎Tez
- 2.1 安装Tez
- 2.2 集成Tez
- 2.3 测试
- 2.4 注意事项
- 2.4.1 集成tez后,插入数据失败
- 2.4.2 解决方法
欢迎访问笔者个人技术博客: http://rukihuang.xyz/
学习视频来源于尚硅谷,视频链接: 尚硅谷大数据项目数据仓库,电商数仓V1.2新版,Respect!
1 安装Hive2.3
- 上传
apache-hive-2.3.0-bin.tar.gz
到/opt/software
目录下,并解压到/opt/module
tar -zxvf apache-hive-2.3.6-bin.tar.gz -C /opt/module/
- 修改
apache-hive-2.3.6-bin
名称为hive
mv apache-hive-2.3.6-bin hive
- 将Mysql 的
mysql-connector-java-5.1.27-bin.jar
拷贝到/opt/module/hive/lib/
cp /opt/software/mysql-libs/mysql-connector-java-5.1.27/mysql-connector-java-5.1.27-bin.jar /opt/module/hive/lib/
- 在
/opt/module/hive/conf
路径上,创建hive-site.xml
文件
javax.jdo.option.ConnectionURL
jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUserName
root
username to use against metastore database
javax.jdo.option.ConnectionPassword
root
password to use against metastore database
hive.metastore.warehouse.dir
/user/hive/warehouse
location of default database for the warehouse
hive.cli.print.header
true
hive.cli.print.current.db
true
hive.metastore.schema.verification
false
datanucleus.schema.autoCreateAll
true
hive.execution.engine
tez
- 启动hive(
/opt/module/hive
)
bin/hive
2 Hive集成引擎Tez
- Tez 是一个Hive 的运行引擎,性能优于MR。
文章图片
- 用Hive 直接编写MR 程序,假设有四个有依赖关系的MR 作业,上图中,绿色是ReduceTask,云状表示写屏蔽,需要将中间结果持久化写到HDFS。
- Tez 可以将多个有依赖的作业转换为一个作业,这样只需写一次HDFS,且中间节点较少,从而大大提升作业的计算性能。
- 拷贝
apache-tez-0.9.1-bin.tar.gz
到hadoop102 的/opt/software
目录 - 将
apache-tez-0.9.1-bin.tar.gz
上传到HDFS 的/tez
目录下。(方便集群节点共享)
hadoop fs -mkdir /tez
hadoop fs -put /opt/software/apache-tez-0.9.1-bin.tar.gz/ /tez
- 解压缩
apache-tez-0.9.1-bin.tar.gz
tar -zxvf apache-tez-0.9.1-bin.tar.gz -C /opt/module
- 修改名称 (
/opt/module
)
mv apache-tez-0.9.1-bin/ tez-0.9.1
2.2 集成Tez
- 进入到Hive 的配置目录:
/opt/module/hive/conf
- 在Hive 的
/opt/module/hive/conf
下面创建一个tez-site.xml
文件
tez.lib.uris
${fs.defaultFS}/tez/apache-tez-0.9.1-bin.tar.gz
tez.use.cluster.hadoop-libs
true
tez.history.logging.service.class
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService
- 在
hive-env.sh
文件中添加tez 环境变量配置和依赖包环境变量配置
mv hive-env.sh.template hive-env.sh
# Set HADOOP_HOME to point to a specific hadoop install directory
export HADOOP_HOME=/opt/module/hadoop-2.7.2
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/opt/module/hive/conf
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
export TEZ_HOME=/opt/module/tez-0.9.1 #是你的tez 的解压目录
export TEZ_JARS=""
for jar in `ls $TEZ_HOME |grep jar`;
do
export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/$jar
donefor jar in `ls $TEZ_HOME/lib`;
do
export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/lib/$jar
doneexport HIVE_AUX_JARS_PATH=/opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar$TEZ_JARS
- 在
hive-site.xml
文件中添加如下配置,更改hive 计算引擎(步骤1.4已经添加)
hive.execution.engine
tez
2.3 测试
- 在
/opt/module/hive
目录下启动Hive
bin/hive
- 创建表
create table student(
id int,
name string);
- 插入数据(我在这一步报错,解决方法详见2.4注意事项)
insert into student values(1,"ruki");
文章图片
- 查询一下没有报错表示成功了
文章图片
2.4 注意事项 2.4.1 集成tez后,插入数据失败
- 运行Tez 时检查到用过多内存而被
NodeManager
杀死进程问题:
Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession
has already shutdown. Application application_1546781144082_0005
failed 2 times due to AM Container for appattempt_1546781144082_0005_000002 exited with exitCode: -103
For more detailed output, check application tracking
page:http://hadoop103:8088/cluster/app/application_15467811440
82_0005Then, click on links to logs of each attempt.
Diagnostics: Container
[pid=11116,containerID=container_1546781144082_0005_02_000001]
is running beyond virtual memory limits. Current usage: 216.3 MB
of 1 GB physical memory used;
2.6 GB of 2.1 GB virtual memory used.
Killing container.
- 这种问题是从机上运行的
Container
试图使用过多的内存,而被NodeManager
kill 掉了。
[摘录] The NodeManager is killing your container. It sounds like
you are trying to use hadoop streaming which is running as a child
process of the map-reduce task. The NodeManager monitors the entire
process tree of the task and if it eats up more memory than the
maximum set in mapreduce.map.memory.mb or
mapreduce.reduce.memory.mb respectively, we would expect the
Nodemanager to kill the task, otherwise your task is stealing memory
belonging to other containers, which you don't want.
2.4.2 解决方法
- 关掉虚拟内存检查,修改
yarn-site.xml
yarn.nodemanager.vmem-check-enabled
false
- 修改后一定要分发,并重新启动hadoop 集群。
xsync yarn-site.xml
推荐阅读
- 人工智能|干货!人体姿态估计与运动预测
- Python专栏|数据分析的常规流程
- 读书笔记|《白话大数据和机器学习》学习笔记1
- 网络|一文彻底搞懂前端监控
- html5|各行业工资单出炉 IT类连续多年霸占“榜首”位置
- 人工智能|【机器学习】深度盘点(详细介绍 Python 中的 7 种交叉验证方法!)
- 网络|简单聊聊压缩网络
- 数据库|效率最高的Excel数据导入---(c#调用SSIS Package将数据库数据导入到Excel文件中【附源代码下载】)...
- r语言|手把手(R语言文本挖掘和词云可视化实践)
- 腾讯|SaaS的收入模型有哪些(终于有人讲明白了)