This is article 14 in the Big Data series. Focuses on introducing multiple Hive data import/export methods and practical demonstrations of common HQL query operations, using employee dataset as running example.

Complete illustrated version: CSDN Original | Juejin

Test Dataset

This article uses classic employee table as example, fields as follows:

empno, ename, job, mgr, hiredate, sal, comm, deptno
7369,SMITH,CLERK,7902,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,,20
...

Table creation:

CREATE TABLE emp (
  empno   INT,
  ename   STRING,
  job     STRING,
  mgr     INT,
  hiredate STRING,
  sal     FLOAT,
  comm    FLOAT,
  deptno  INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Data Import Methods

Method 1: LOAD DATA LOCAL (Local File)

Import from client local filesystem, OVERWRITE parameter clears existing data:

-- Append import
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' INTO TABLE emp;

-- Overwrite import
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' OVERWRITE INTO TABLE emp;

Method 2: LOAD DATA (HDFS File)

Import from HDFS, original file will be moved (not copied) to Hive warehouse directory:

-- First upload file to HDFS
-- hdfs dfs -put /local/emp.csv /user/hadoop/

LOAD DATA INPATH '/user/hadoop/emp.csv' INTO TABLE emp;

Method 3: INSERT … SELECT

Query from other tables and write, supports field mapping:

-- Append write
INSERT INTO TABLE emp_backup
SELECT * FROM emp WHERE deptno = 20;

-- Overwrite write
INSERT OVERWRITE TABLE emp_summary
SELECT deptno, COUNT(*) AS cnt, AVG(sal) AS avg_sal
FROM emp
GROUP BY deptno;

Method 4: External Table Direct Reference

No import needed, table directly references existing data on HDFS:

CREATE EXTERNAL TABLE emp_ext (
  empno INT, ename STRING, sal FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/existing/data/';

Method 5: Sqoop Import from MySQL

sqoop import \
  --connect jdbc:mysql://localhost:3306/hr \
  --username root --password root \
  --table emp \
  --hive-import \
  --hive-table emp \
  -m 1

Data Export Methods

Export to Local Directory

INSERT OVERWRITE LOCAL DIRECTORY '/opt/wzk/hive_export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;

Export to HDFS

Simply remove the LOCAL keyword:

INSERT OVERWRITE DIRECTORY '/user/hadoop/export/emp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM emp;

EXPORT Command (Includes Metadata)

-- Export table structure and data to HDFS (for migration or backup)
EXPORT TABLE emp TO '/user/hadoop/backup/emp';

-- Restore
IMPORT TABLE emp_restore FROM '/user/hadoop/backup/emp';

Sqoop Export to MySQL

sqoop export \
  --connect jdbc:mysql://localhost:3306/hr \
  --username root --password root \
  --table emp_result \
  --export-dir /user/hive/warehouse/emp \
  -m 1

HQL Query Practice

Aggregate Functions

-- Salary statistics
SELECT
  SUM(sal)  AS total_sal,
  MAX(sal)  AS max_sal,
  MIN(sal)  AS min_sal,
  AVG(sal)  AS avg_sal,
  COUNT(*)  AS emp_count
FROM emp;

Conditional Filtering

-- Comparison operators
SELECT * FROM emp WHERE sal > 2000;

-- IN multiple value matching
SELECT * FROM emp WHERE deptno IN (10, 20);

-- NULL handling
SELECT * FROM emp WHERE comm IS NULL;
SELECT * FROM emp WHERE comm IS NOT NULL;

-- NULL-safe equality (<=> equals IS NOT DISTINCT FROM)
SELECT * FROM emp WHERE comm <=> NULL;

Grouping Statistics

SELECT
  deptno,
  job,
  COUNT(*)    AS cnt,
  AVG(sal)    AS avg_sal
FROM emp
GROUP BY deptno, job
HAVING AVG(sal) > 1500
ORDER BY deptno, avg_sal DESC;

Sorting and Pagination

-- Global sort (single Reducer)
SELECT * FROM emp ORDER BY sal DESC LIMIT 5;

-- Sort within partition (multiple Reducers, better performance)
SELECT * FROM emp DISTRIBUTE BY deptno SORT BY sal DESC;

-- CLUSTER BY (equivalent to DISTRIBUTE BY x SORT BY x, ascending only)
SELECT * FROM emp CLUSTER BY deptno;

Set Operations

-- UNION ALL (keep duplicate rows)
SELECT ename, sal FROM emp WHERE deptno = 10
UNION ALL
SELECT ename, sal FROM emp WHERE deptno = 20;

-- UNION (deduplicate)
SELECT ename FROM emp WHERE sal > 3000
UNION
SELECT ename FROM emp WHERE job = 'MANAGER';

ORDER BY vs SORT BY vs DISTRIBUTE BY

KeywordReducersGlobal OrderedUse Case
ORDER BY1YesSmall result sets
SORT BYMultipleNo (ordered within partition)Large data volume local sorting
DISTRIBUTE BYMultipleNoControl data distribution rules
CLUSTER BYMultipleNo (equals DISTRIBUTE+SORT)Distribute and sort by same column

In production environment, using ORDER BY on large datasets forces single Reducer processing, easily causing performance bottlenecks. Prefer using SORT BY combined with DISTRIBUTE BY.