This is article 13 in the Big Data series. Systematically introduces Hive DDL and DML operations including database management, internal/external table creation, data loading and querying.
Complete illustrated version: CSDN Original | Juejin
Start Hive
hive # Enter interactive CLI
# or
hiveserver2 # Start HiveServer2 service (for JDBC connections)
DDL — Database Operations
Create Database
-- Default path: /user/hive/warehouse/mydb.db
CREATE DATABASE IF NOT EXISTS mydb;
-- Specify HDFS storage path
CREATE DATABASE IF NOT EXISTS mydb
LOCATION '/user/custom/mydb';
Switch and View Database
USE mydb;
SHOW DATABASES;
DESCRIBE DATABASE mydb;
Delete Database
-- Only delete empty database
DROP DATABASE mydb;
-- Cascade delete (including all tables)
DROP DATABASE mydb CASCADE;
DDL — Table Operations
Internal Table (Managed Table)
Data managed by Hive, DROP TABLE deletes both data and metadata:
CREATE TABLE IF NOT EXISTS emp (
id INT,
name STRING,
salary FLOAT,
dept STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
External Table (External Table)
Data stored outside Hive, DROP TABLE only deletes metadata, HDFS data preserved:
CREATE EXTERNAL TABLE IF NOT EXISTS emp_ext (
id INT,
name STRING,
salary FLOAT,
dept STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/empdata';
Usage Suggestion: Prefer external tables in production to prevent accidental data deletion.
Internal Table vs External Table
| Feature | Internal Table | External Table |
|---|---|---|
| Data Management | Hive manages | User manages |
| DROP Behavior | Deletes data + metadata | Only deletes metadata |
| Data Location | warehouse directory | Custom LOCATION |
| Applicable Scenario | Temporary tables, intermediate results | Production data, shared data |
View and Modify Table
SHOW TABLES;
DESCRIBE emp; -- View column info
DESCRIBE FORMATTED emp; -- View detailed info (storage format, HDFS path)
-- Rename
ALTER TABLE emp RENAME TO employee;
-- Add columns
ALTER TABLE emp ADD COLUMNS (email STRING);
Delete Table
DROP TABLE IF EXISTS emp;
TRUNCATE TABLE emp; -- Only clear data, keep table structure (internal tables only)
DML — Data Operations
Load Local File
-- Load from local filesystem (MOVE, original file not deleted)
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' INTO TABLE emp;
-- Overwrite load
LOAD DATA LOCAL INPATH '/home/hadoop/emp.csv' OVERWRITE INTO TABLE emp;
Load from HDFS
-- Load from HDFS (MOVE, original HDFS file is moved)
LOAD DATA INPATH '/user/hadoop/emp.csv' INTO TABLE emp;
INSERT
-- Single record
INSERT INTO TABLE emp VALUES (1, 'Tom', 5000.0, 'Engineering');
-- Insert from query result
INSERT INTO TABLE emp_high_salary
SELECT * FROM emp WHERE salary > 8000;
-- Overwrite
INSERT OVERWRITE TABLE emp_copy
SELECT * FROM emp;
Multi-table Insert (One Scan, Multiple Writes)
FROM emp
INSERT INTO TABLE high_salary SELECT * WHERE salary > 8000
INSERT INTO TABLE low_salary SELECT * WHERE salary <= 8000;
This approach only needs to scan source table once, significantly reducing MapReduce job count.
CTAS (Create Table As Select)
CREATE TABLE emp_copy AS SELECT * FROM emp;
DQL — Query Operations
-- Basic query
SELECT id, name, salary FROM emp WHERE dept = 'Engineering';
-- Sort and pagination
SELECT * FROM emp ORDER BY salary DESC LIMIT 10;
-- Aggregation statistics
SELECT dept,
COUNT(*) AS cnt,
AVG(salary) AS avg_sal,
MAX(salary) AS max_sal
FROM emp
GROUP BY dept
HAVING COUNT(*) > 3;
-- JOIN query
SELECT a.name, b.dept_name
FROM emp a
JOIN dept b ON a.dept = b.dept_id;
Hive Utility Commands
-- Execute Shell commands in Hive CLI
!ls /home/hadoop;
!clear;
-- Execute HDFS commands in Hive CLI
dfs -ls /user/hive/warehouse;
dfs -mkdir /user/test;
-- View built-in function list
SHOW FUNCTIONS;
-- View function usage
DESCRIBE FUNCTION substr;
Common hive-site.xml Configuration
<!-- Use local mode for small datasets, skip YARN submission, speed up debugging -->
<property>
<name>hive.exec.mode.local.auto</name>
<value>true</value>
</property>
<!-- CLI prompt shows current database -->
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<!-- Query results show column names -->
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
After enabling local mode, queries on small datasets run directly in client JVM without submitting YARN jobs, significantly improving query speed—suitable for development and debugging.