Hive

-> Hadoop SQL Engine

概述

产生背景

定义

特点之一:统一元数据管理,方便切换 query 引擎,如 Hive,Spark SQL,impala…

架构

流程

hive_arc_1

模块

hive_arc_2

Hive v.s. RDBMS

Hive具有sql数据库外表,但应用场景完全不同,hive只适合用来做批量数据统计分析。

hive_compare

Hive数据导出到RDBMS:sqoop -> https://sqoop.apache.org/

存储格式

数据模型

Hive 中包含以下数据模型:

安装 MySQL

Hive的Metadata一般存在MySQL,所以需要先在服务器上安装好MySQL数据库。

参考 -> https://github.com/chennanni/note-tech/blob/master/db/mysql/index.md

部署 Hive

1.下载+解压:http://archive.apache.org/dist/hive/

(注:如果是CDH版本的,version应该保持一致;如果是apache版本的,version可能不一致,如hadoop-2.7.2 对应 hive-2.3.6

2.添加HIVE_HOME到系统环境变量

export HIVE_HOME=$HOME/hive/apache-hive-2.3.6-bin
export PATH=$PATH:$HIVE_HOME/bin

3.修改配置

(create if not there) /conf/hive-env.sh

export HADOOP_HOME=/root/hadoop/hadoop-2.7.1

(create if not there) /conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://hadoop000:3306/hadoop_hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.jdbc.Driver</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>root</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>123456</value>
	</property>
</configuration>

4.拷贝MySQL驱动包到 $HIVE_HOME/lib

下载地址: https://dev.mysql.com/downloads/connector/j/8.0.html

mysql-connector-java-5.1.48-bin.jar

使用 Hive

$ hive

> create database test_db;
> show databases;

语法和SQL非常类似。

DDL

hive> DROP TABLE invites;
hive> CREATE TABLE invites (foo INT, bar STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
hive> DESCRIBE FORMATTED invites;

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION hdfs_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

注:/user/hive/warehouse是Hive默认的存储在HDFS上的路径

DML

# 导入数据
hive> LOAD DATA LOCAL INPATH './examples/files/invites_test.txt' OVERWRITE INTO TABLE invites;
hive> LOAD DATA INPATH '/user/file/emp_test.txt' OVERWRITE INTO TABLE emp;

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

Query

# 查询数据
select deptno, avg(sal) from emp group by deptno;

# 导出数据到文件
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from emp;

# 查看执行计划
explain EXTENDED
select e.empno,e.ename,e.sal,e.deptno,d.dname
from emp e join dept d
on e.deptno=d.deptno;

参考 -> https://cwiki.apache.org/confluence/display/HIVE#Home-UserDocumentation

内部表 v.s. 外部表

MANAGED_TABLE:内部表;删除表:HDFS上的数据被删除 & Meta也被删除

(一般普通建表语句创建的就是内部表)

CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE emp;

EXTERNAL_TABLE:外部表;HDFS上的数据不被删除 & Meta被删除

(建表时,需要加上EXTERNAL关键字,并指定数据存放位置)

CREATE EXTERNAL TABLE emp_external(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/external/emp/';

LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE emp_external;

常见工作流程

定时任务调度

crontab

-> https://www.runoob.com/linux/linux-comm-crontab.html

Azkaban

-> https://azkaban.readthedocs.io/en/latest/

参考

Fork me on GitHub