Hive Introduction: Architecture and Cluster Installation

This is article 12 in the Big Data series. Introduces basic concepts of Hive data warehouse, architecture principles, and complete steps to install Hive 2.3.9 on three-node Hadoop cluster.

Complete illustrated version: CSDN Original | Juejin

What is Hive

Hive is a data warehouse tool based on Hadoop. Its core capability is mapping structured data files to database tables and providing SQL-like query language (HiveQL), with underlying HQL automatically converted to MapReduce tasks for execution.

Hive is not a database—it has no storage engine of its own. Data is stored in HDFS, computation depends on MapReduce/Tez/Spark.

Core Features

Write Once Read Many: Suitable for batch offline analysis
SQL Dialect: HiveQL syntax close to standard SQL, lower learning curve
Extensible: Supports custom functions (UDF/UDAF/UDTF)
Fault Tolerant: Relies on HDFS and YARN’s fault tolerance mechanisms

Hive Architecture

Client (CLI / JDBC / Web UI)
        ↓
  HiveServer2 (Thrift Service)
        ↓
  Driver (Parse → Compile → Optimize → Execute)
        ↓
  Metastore (Metadata: Table Structure, Partitions, HDFS Paths)
        ↓
  Execution Engine (MapReduce / Tez / Spark)
        ↓
  HDFS (Actual Data Storage)

Component	Responsibility
Driver	SQL parsing, logical/physical plan generation
Metastore	Stores metadata (default MySQL/MariaDB)
HiveServer2	Exposes JDBC/Thrift interface externally
Execution Engine	Submits physical plan to YARN for execution

Pros and Cons Analysis

Pros

Low learning cost, SQL engineers can get started quickly
Can handle PB-level data (MapReduce scales horizontally)
Supports UDF, flexible business logic extension
Unified metadata management, shares Metastore with Spark/Impala

Cons

HQL expression capability limited, complex iterative computation difficult
MapReduce execution efficiency low, high latency (minute-level)
Auto-generated MR code lacks targeted optimization
Difficult to tune, requires deep understanding of underlying mechanisms

Install Hive 2.3.9

1. Download and Extract

tar -zxvf apache-hive-2.3.9-bin.tar.gz -C /opt/servers/

2. Configure Environment Variables

Edit /etc/profile, add:

export HIVE_HOME=/opt/servers/apache-hive-2.3.9-bin
export PATH=$PATH:$HIVE_HOME/bin

Apply:

source /etc/profile

3. Install and Configure MariaDB (Store Metadata)

# Install MariaDB
sudo apt install mariadb-server -y

# Configure remote access (edit bind address)
# Edit /etc/mysql/mariadb.conf.d/50-server.cnf
# Change bind-address = 127.0.0.1 to 0.0.0.0

# Create Hive metadata database and user
mysql -u root -p
CREATE DATABASE hive_meta DEFAULT CHARACTER SET utf8;
CREATE USER 'hive'@'%' IDENTIFIED BY 'hive123';
GRANT ALL ON hive_meta.* TO 'hive'@'%';
FLUSH PRIVILEGES;

4. Configure hive-site.xml

Create hive-site.xml in $HIVE_HOME/conf/:

<configuration>
  <!-- Metadata database connection -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://h121.wzk.icu:3306/hive_meta?useSSL=false&amp;characterEncoding=UTF-8</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive123</value>
  </property>

  <!-- Hive data warehouse path -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>

  <!-- CLI display optimization -->
  <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.cli.print.header</name>
    <value>true</value>
  </property>
</configuration>

5. Install MySQL JDBC Driver

Copy mysql-connector-java-5.1.x.jar to $HIVE_HOME/lib/.

6. Initialize Metadata Schema

schematool -dbType mysql -initSchema

Success creates about 70 Hive metadata tables in MariaDB.

7. Verify Installation

hive
# Enter Hive CLI
hive> show databases;
# Output: default

Data Warehouse Default Path

Hive defaults to storing data in HDFS under /user/hive/warehouse/:

Database: /user/hive/warehouse/<db_name>.db/
Table: /user/hive/warehouse/<db_name>.db/<table_name>/

Next article will cover Hive DDL/DML operations including database/table creation and data import.