Hive Metastore Three Modes and Remote Deployment

This is article 15 in the Big Data series. Deep dive into Hive Metastore’s role, differences between three deployment modes, and practical steps to configure remote Metastore high availability in production clusters.

Complete illustrated version: CSDN Original | Juejin

What is Metastore

Hive Metastore is the core component for managing metadata, responsible for storing and maintaining:

Database, table, column definitions (Schema)
Partition information
Data storage path mapping in HDFS
Table statistics (used for query optimization)
SerDe (serialization/deserialization) configuration

Metastore persists metadata to relational database (MySQL/MariaDB in production) and provides services through Thrift interface.

Three Deployment Modes

Mode 1: Embedded Mode

HiveServer2 Process
    └── Metastore (Embedded)
          └── Derby (Embedded Database)

Derby database runs embedded within Hive process
Only supports single user, no concurrent access
Metadata stored on local disk (metastore_db/ directory)
Only suitable for local single-machine testing, not for multi-node clusters

Mode 2: Local Mode

HiveServer2 Process
    └── Metastore (Same process)
          └── MySQL/MariaDB (Separate Database)

Metastore service runs in same JVM process as HiveServer2
Metastore automatically starts when HiveServer2 starts
Supports multiple users, metadata stored in external MySQL
Suitable for development/testing environments

Mode 3: Remote Mode — Production Recommended

Multiple Clients
    ↓ Thrift
  Standalone Metastore Service (port 9083)
    └── MySQL/MariaDB

Metastore runs as standalone process, provides services through Thrift
Clients (HiveServer2, Spark, Impala, etc.) connect via hive.metastore.uris
Supports multiple Metastore instances for high availability
Metadata storage separated from computation, stable and reliable

Comparison of Three Modes

Feature	Embedded	Local	Remote
Concurrent Support	Single user	Multiple users	Multiple users
Metadata DB	Derby (embedded)	MySQL (separate)	MySQL (separate)
Service Process	Same process	Same process	Separate process
High Availability	No	No	Supports multiple instances
Use Case	Local testing	Dev testing	Production

Cluster Environment Configuration (Remote Mode)

Example cluster:

h121: Running Metastore service + NameNode
h122: Hive client node
h123: Running Metastore service + DataNode

h121 and h123 (Metastore Service Nodes) Configuration

hive-site.xml needs MySQL connection info and Metastore configuration:

<configuration>
  <!-- MySQL connection -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://h121.wzk.icu:3306/hive_meta?useSSL=false&amp;characterEncoding=UTF-8</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive123</value>
  </property>

  <!-- Warehouse path -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
</configuration>

h122 (Client Node) Configuration

Client doesn’t need MySQL connection info, just point to remote Metastore address:

<configuration>
  <!-- Remote Metastore address (configure two for high availability) -->
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://h121.wzk.icu:9083,thrift://h123.wzk.icu:9083</value>
  </property>

  <!-- Keep warehouse path consistent -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
</configuration>

Start and Verify

Start Metastore Service (Execute on both h121 and h123)

# Start Metastore service in background
nohup hive --service metastore > /opt/logs/metastore.log 2>&1 &

# Verify port 9083 is listening
lsof -i:9083
# or
netstat -tlnp | grep 9083

Expected output (port listening normally):

COMMAND  PID  USER  TYPE  DEVICE  NAME
java    1234  hadoop TCP  *:9083  (LISTEN)

Client Connection Test (Execute on h122)

hive

-- Test metadata access
SHOW DATABASES;
USE default;
SHOW TABLES;
SELECT * FROM emp LIMIT 5;

If results return normally, remote Metastore is configured successfully.

Common Issues Troubleshooting

Issue 1: Metastore connection timeout

Check if Metastore process is alive on h121/h123:

jps | grep HiveMetaStore

Issue 2: Metadata version mismatch

Re-run initialization:

schematool -dbType mysql -initSchema
# If already exists, upgrade
schematool -dbType mysql -upgradeSchema

Issue 3: Metadata inconsistency across nodes

Ensure h121 and h123 connect to the same MySQL instance, not separate databases. Two Metastore instances share same metadata, consistency guaranteed by MySQL’s concurrency control.

The biggest value of remote Metastore is that it can be shared by multiple compute engines:

Hive  ─┐
Spark ─┤──→ Metastore (thrift://h121:9083)──→ MySQL
Impala─┘

Spark can access the same Hive metadata by configuring spark.hadoop.hive.metastore.uris, enabling cross-engine table sharing. This is the foundation for building a unified data lake.