This is article 14 in the Big Data series. Focuses on introducing multiple Hive data import/export methods and practical demonstrations of common HQL query operations, using employee dataset as running example.
Complete illustrated version: CSDN Original | Juejin
Test Dataset
This article uses classic employee table as example, fields as follows:
empno, ename, job, mgr, hiredate, sal, comm, deptno
7369,SMITH,CLERK,7902,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,,20
...
Table creation:
CREATE TABLE emp (
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate STRING,
sal FLOAT,
comm FLOAT,
deptno INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Data Import Methods
Method 1: LOAD DATA LOCAL (Local File)
Import from client local filesystem, OVERWRITE parameter clears existing data:
-- Append import
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' INTO TABLE emp;
-- Overwrite import
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' OVERWRITE INTO TABLE emp;
Method 2: LOAD DATA (HDFS File)
Import from HDFS, original file will be moved (not copied) to Hive warehouse directory:
-- First upload file to HDFS
-- hdfs dfs -put /local/emp.csv /user/hadoop/
LOAD DATA INPATH '/user/hadoop/emp.csv' INTO TABLE emp;
Method 3: INSERT … SELECT
Query from other tables and write, supports field mapping:
-- Append write
INSERT INTO TABLE emp_backup
SELECT * FROM emp WHERE deptno = 20;
-- Overwrite write
INSERT OVERWRITE TABLE emp_summary
SELECT deptno, COUNT(*) AS cnt, AVG(sal) AS avg_sal
FROM emp
GROUP BY deptno;
Method 4: External Table Direct Reference
No import needed, table directly references existing data on HDFS:
CREATE EXTERNAL TABLE emp_ext (
empno INT, ename STRING, sal FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/existing/data/';
Method 5: Sqoop Import from MySQL
sqoop import \
--connect jdbc:mysql://localhost:3306/hr \
--username root --password root \
--table emp \
--hive-import \
--hive-table emp \
-m 1
Data Export Methods
Export to Local Directory
INSERT OVERWRITE LOCAL DIRECTORY '/opt/wzk/hive_export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;
Export to HDFS
Simply remove the LOCAL keyword:
INSERT OVERWRITE DIRECTORY '/user/hadoop/export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;
EXPORT Command (Includes Metadata)
-- Export table structure and data to HDFS (for migration or backup)
EXPORT TABLE emp TO '/user/hadoop/backup/emp';
-- Restore
IMPORT TABLE emp_restore FROM '/user/hadoop/backup/emp';
Sqoop Export to MySQL
sqoop export \
--connect jdbc:mysql://localhost:3306/hr \
--username root --password root \
--table emp_result \
--export-dir /user/hive/warehouse/emp \
-m 1
HQL Query Practice
Aggregate Functions
-- Salary statistics
SELECT
SUM(sal) AS total_sal,
MAX(sal) AS max_sal,
MIN(sal) AS min_sal,
AVG(sal) AS avg_sal,
COUNT(*) AS emp_count
FROM emp;
Conditional Filtering
-- Comparison operators
SELECT * FROM emp WHERE sal > 2000;
-- IN multiple value matching
SELECT * FROM emp WHERE deptno IN (10, 20);
-- NULL handling
SELECT * FROM emp WHERE comm IS NULL;
SELECT * FROM emp WHERE comm IS NOT NULL;
-- NULL-safe equality (<=> equals IS NOT DISTINCT FROM)
SELECT * FROM emp WHERE comm <=> NULL;
Grouping Statistics
SELECT
deptno,
job,
COUNT(*) AS cnt,
AVG(sal) AS avg_sal
FROM emp
GROUP BY deptno, job
HAVING AVG(sal) > 1500
ORDER BY deptno, avg_sal DESC;
Sorting and Pagination
-- Global sort (single Reducer)
SELECT * FROM emp ORDER BY sal DESC LIMIT 5;
-- Sort within partition (multiple Reducers, better performance)
SELECT * FROM emp DISTRIBUTE BY deptno SORT BY sal DESC;
-- CLUSTER BY (equivalent to DISTRIBUTE BY x SORT BY x, ascending only)
SELECT * FROM emp CLUSTER BY deptno;
Set Operations
-- UNION ALL (keep duplicate rows)
SELECT ename, sal FROM emp WHERE deptno = 10
UNION ALL
SELECT ename, sal FROM emp WHERE deptno = 20;
-- UNION (deduplicate)
SELECT ename FROM emp WHERE sal > 3000
UNION
SELECT ename FROM emp WHERE job = 'MANAGER';
ORDER BY vs SORT BY vs DISTRIBUTE BY
| Keyword | Reducers | Global Ordered | Use Case |
|---|---|---|---|
ORDER BY | 1 | Yes | Small result sets |
SORT BY | Multiple | No (ordered within partition) | Large data volume local sorting |
DISTRIBUTE BY | Multiple | No | Control data distribution rules |
CLUSTER BY | Multiple | No (equals DISTRIBUTE+SORT) | Distribute and sort by same column |
In production environment, using ORDER BY on large datasets forces single Reducer processing, easily causing performance bottlenecks. Prefer using SORT BY combined with DISTRIBUTE BY.