本文是大数据系列第 14 篇,重点介绍 Hive 数据导入导出的多种方式,以及常用 HQL 查询操作的实战演示,使用员工数据集作为贯穿示例。
测试数据集
本文使用经典员工表作为示例,字段如下:
empno, ename, job, mgr, hiredate, sal, comm, deptno
7369,SMITH,CLERK,7902,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,,20
...
建表语句:
CREATE TABLE emp (
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate STRING,
sal FLOAT,
comm FLOAT,
deptno INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
数据导入方式
方式一:LOAD DATA LOCAL(本地文件)
从客户端本地文件系统导入,OVERWRITE 参数清空已有数据:
-- 追加导入
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' INTO TABLE emp;
-- 覆盖导入
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' OVERWRITE INTO TABLE emp;
方式二:LOAD DATA(HDFS 文件)
从 HDFS 导入,原文件会被移动(不是复制)到 Hive 仓库目录:
-- 先将文件上传到 HDFS
-- hdfs dfs -put /local/emp.csv /user/hadoop/
LOAD DATA INPATH '/user/hadoop/emp.csv' INTO TABLE emp;
方式三:INSERT … SELECT
从其他表查询并写入,支持字段映射:
-- 追加写入
INSERT INTO TABLE emp_backup
SELECT * FROM emp WHERE deptno = 20;
-- 覆盖写入
INSERT OVERWRITE TABLE emp_summary
SELECT deptno, COUNT(*) AS cnt, AVG(sal) AS avg_sal
FROM emp
GROUP BY deptno;
方式四:外部表直接引用
无需导入,表直接引用 HDFS 上已有数据:
CREATE EXTERNAL TABLE emp_ext (
empno INT, ename STRING, sal FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/existing/data/';
方式五:Sqoop 从 MySQL 导入
sqoop import \
--connect jdbc:mysql://localhost:3306/hr \
--username root --password root \
--table emp \
--hive-import \
--hive-table emp \
-m 1
数据导出方式
导出到本地目录
INSERT OVERWRITE LOCAL DIRECTORY '/opt/wzk/hive_export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;
导出到 HDFS
去掉 LOCAL 关键字即可:
INSERT OVERWRITE DIRECTORY '/user/hadoop/export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;
EXPORT 命令(含元数据)
-- 导出表结构和数据到 HDFS(用于迁移或备份)
EXPORT TABLE emp TO '/user/hadoop/backup/emp';
-- 还原
IMPORT TABLE emp_restore FROM '/user/hadoop/backup/emp';
Sqoop 导出到 MySQL
sqoop export \
--connect jdbc:mysql://localhost:3306/hr \
--username root --password root \
--table emp_result \
--export-dir /user/hive/warehouse/emp \
-m 1
HQL 查询实战
聚合函数
-- 薪资统计
SELECT
SUM(sal) AS total_sal,
MAX(sal) AS max_sal,
MIN(sal) AS min_sal,
AVG(sal) AS avg_sal,
COUNT(*) AS emp_count
FROM emp;
条件过滤
-- 比较运算
SELECT * FROM emp WHERE sal > 2000;
-- IN 多值匹配
SELECT * FROM emp WHERE deptno IN (10, 20);
-- NULL 处理
SELECT * FROM emp WHERE comm IS NULL;
SELECT * FROM emp WHERE comm IS NOT NULL;
-- NULL 安全等值(<=> 等于 IS NOT DISTINCT FROM)
SELECT * FROM emp WHERE comm <=> NULL;
分组统计
SELECT
deptno,
job,
COUNT(*) AS cnt,
AVG(sal) AS avg_sal
FROM emp
GROUP BY deptno, job
HAVING AVG(sal) > 1500
ORDER BY deptno, avg_sal DESC;
排序与分页
-- 全局排序(单 Reducer)
SELECT * FROM emp ORDER BY sal DESC LIMIT 5;
-- 分区内排序(多 Reducer,性能更好)
SELECT * FROM emp DISTRIBUTE BY deptno SORT BY sal DESC;
-- CLUSTER BY(等同于 DISTRIBUTE BY x SORT BY x,仅升序)
SELECT * FROM emp CLUSTER BY deptno;
集合操作
-- UNION ALL(保留重复行)
SELECT ename, sal FROM emp WHERE deptno = 10
UNION ALL
SELECT ename, sal FROM emp WHERE deptno = 20;
-- UNION(去重)
SELECT ename FROM emp WHERE sal > 3000
UNION
SELECT ename FROM emp WHERE job = 'MANAGER';
ORDER BY vs SORT BY vs DISTRIBUTE BY
| 关键字 | Reducer 数 | 全局有序 | 使用场景 |
|---|---|---|---|
ORDER BY | 1 | 是 | 结果集较小时 |
SORT BY | 多个 | 否(分区内有序) | 大数据量局部排序 |
DISTRIBUTE BY | 多个 | 否 | 控制数据分发规则 |
CLUSTER BY | 多个 | 否(等于 DISTRIBUTE+SORT) | 按同一列分发且排序 |
生产环境中,对大数据集使用 ORDER BY 会强制单 Reducer 处理,极易造成性能瓶颈,应优先使用 SORT BY 结合 DISTRIBUTE BY。