本文是大数据系列第 14 篇,重点介绍 Hive 数据导入导出的多种方式,以及常用 HQL 查询操作的实战演示,使用员工数据集作为贯穿示例。

完整图文版(含截图):CSDN 原文 | 掘金

测试数据集

本文使用经典员工表作为示例,字段如下:

empno, ename, job, mgr, hiredate, sal, comm, deptno
7369,SMITH,CLERK,7902,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,,20
...

建表语句:

CREATE TABLE emp (
  empno   INT,
  ename   STRING,
  job     STRING,
  mgr     INT,
  hiredate STRING,
  sal     FLOAT,
  comm    FLOAT,
  deptno  INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

数据导入方式

方式一:LOAD DATA LOCAL(本地文件)

从客户端本地文件系统导入,OVERWRITE 参数清空已有数据:

-- 追加导入
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' INTO TABLE emp;

-- 覆盖导入
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' OVERWRITE INTO TABLE emp;

方式二:LOAD DATA(HDFS 文件)

从 HDFS 导入,原文件会被移动(不是复制)到 Hive 仓库目录:

-- 先将文件上传到 HDFS
-- hdfs dfs -put /local/emp.csv /user/hadoop/

LOAD DATA INPATH '/user/hadoop/emp.csv' INTO TABLE emp;

方式三:INSERT … SELECT

从其他表查询并写入,支持字段映射:

-- 追加写入
INSERT INTO TABLE emp_backup
SELECT * FROM emp WHERE deptno = 20;

-- 覆盖写入
INSERT OVERWRITE TABLE emp_summary
SELECT deptno, COUNT(*) AS cnt, AVG(sal) AS avg_sal
FROM emp
GROUP BY deptno;

方式四:外部表直接引用

无需导入,表直接引用 HDFS 上已有数据:

CREATE EXTERNAL TABLE emp_ext (
  empno INT, ename STRING, sal FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/existing/data/';

方式五:Sqoop 从 MySQL 导入

sqoop import \
  --connect jdbc:mysql://localhost:3306/hr \
  --username root --password root \
  --table emp \
  --hive-import \
  --hive-table emp \
  -m 1

数据导出方式

导出到本地目录

INSERT OVERWRITE LOCAL DIRECTORY '/opt/wzk/hive_export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;

导出到 HDFS

去掉 LOCAL 关键字即可:

INSERT OVERWRITE DIRECTORY '/user/hadoop/export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;

EXPORT 命令(含元数据)

-- 导出表结构和数据到 HDFS(用于迁移或备份)
EXPORT TABLE emp TO '/user/hadoop/backup/emp';

-- 还原
IMPORT TABLE emp_restore FROM '/user/hadoop/backup/emp';

Sqoop 导出到 MySQL

sqoop export \
  --connect jdbc:mysql://localhost:3306/hr \
  --username root --password root \
  --table emp_result \
  --export-dir /user/hive/warehouse/emp \
  -m 1

HQL 查询实战

聚合函数

-- 薪资统计
SELECT
  SUM(sal)  AS total_sal,
  MAX(sal)  AS max_sal,
  MIN(sal)  AS min_sal,
  AVG(sal)  AS avg_sal,
  COUNT(*)  AS emp_count
FROM emp;

条件过滤

-- 比较运算
SELECT * FROM emp WHERE sal > 2000;

-- IN 多值匹配
SELECT * FROM emp WHERE deptno IN (10, 20);

-- NULL 处理
SELECT * FROM emp WHERE comm IS NULL;
SELECT * FROM emp WHERE comm IS NOT NULL;

-- NULL 安全等值(<=> 等于 IS NOT DISTINCT FROM)
SELECT * FROM emp WHERE comm <=> NULL;

分组统计

SELECT
  deptno,
  job,
  COUNT(*)    AS cnt,
  AVG(sal)    AS avg_sal
FROM emp
GROUP BY deptno, job
HAVING AVG(sal) > 1500
ORDER BY deptno, avg_sal DESC;

排序与分页

-- 全局排序(单 Reducer)
SELECT * FROM emp ORDER BY sal DESC LIMIT 5;

-- 分区内排序(多 Reducer,性能更好)
SELECT * FROM emp DISTRIBUTE BY deptno SORT BY sal DESC;

-- CLUSTER BY(等同于 DISTRIBUTE BY x SORT BY x,仅升序)
SELECT * FROM emp CLUSTER BY deptno;

集合操作

-- UNION ALL(保留重复行)
SELECT ename, sal FROM emp WHERE deptno = 10
UNION ALL
SELECT ename, sal FROM emp WHERE deptno = 20;

-- UNION(去重)
SELECT ename FROM emp WHERE sal > 3000
UNION
SELECT ename FROM emp WHERE job = 'MANAGER';

ORDER BY vs SORT BY vs DISTRIBUTE BY

关键字Reducer 数全局有序使用场景
ORDER BY1结果集较小时
SORT BY多个否(分区内有序)大数据量局部排序
DISTRIBUTE BY多个控制数据分发规则
CLUSTER BY多个否(等于 DISTRIBUTE+SORT)按同一列分发且排序

生产环境中,对大数据集使用 ORDER BY 会强制单 Reducer 处理,极易造成性能瓶颈,应优先使用 SORT BY 结合 DISTRIBUTE BY