This is article 15 in the Big Data series. Deep dive into Hive Metastore’s role, differences between three deployment modes, and practical steps to configure remote Metastore high availability in production clusters.

Complete illustrated version: CSDN Original | Juejin

What is Metastore

Hive Metastore is the core component for managing metadata, responsible for storing and maintaining:

  • Database, table, column definitions (Schema)
  • Partition information
  • Data storage path mapping in HDFS
  • Table statistics (used for query optimization)
  • SerDe (serialization/deserialization) configuration

Metastore persists metadata to relational database (MySQL/MariaDB in production) and provides services through Thrift interface.

Three Deployment Modes

Mode 1: Embedded Mode

HiveServer2 Process
    └── Metastore (Embedded)
          └── Derby (Embedded Database)
  • Derby database runs embedded within Hive process
  • Only supports single user, no concurrent access
  • Metadata stored on local disk (metastore_db/ directory)
  • Only suitable for local single-machine testing, not for multi-node clusters

Mode 2: Local Mode

HiveServer2 Process
    └── Metastore (Same process)
          └── MySQL/MariaDB (Separate Database)
  • Metastore service runs in same JVM process as HiveServer2
  • Metastore automatically starts when HiveServer2 starts
  • Supports multiple users, metadata stored in external MySQL
  • Suitable for development/testing environments
Multiple Clients
    ↓ Thrift
  Standalone Metastore Service (port 9083)
    └── MySQL/MariaDB
  • Metastore runs as standalone process, provides services through Thrift
  • Clients (HiveServer2, Spark, Impala, etc.) connect via hive.metastore.uris
  • Supports multiple Metastore instances for high availability
  • Metadata storage separated from computation, stable and reliable

Comparison of Three Modes

FeatureEmbeddedLocalRemote
Concurrent SupportSingle userMultiple usersMultiple users
Metadata DBDerby (embedded)MySQL (separate)MySQL (separate)
Service ProcessSame processSame processSeparate process
High AvailabilityNoNoSupports multiple instances
Use CaseLocal testingDev testingProduction

Cluster Environment Configuration (Remote Mode)

Example cluster:

  • h121: Running Metastore service + NameNode
  • h122: Hive client node
  • h123: Running Metastore service + DataNode

h121 and h123 (Metastore Service Nodes) Configuration

hive-site.xml needs MySQL connection info and Metastore configuration:

<configuration>
  <!-- MySQL connection -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://h121.wzk.icu:3306/hive_meta?useSSL=false&amp;characterEncoding=UTF-8</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive123</value>
  </property>

  <!-- Warehouse path -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
</configuration>

h122 (Client Node) Configuration

Client doesn’t need MySQL connection info, just point to remote Metastore address:

<configuration>
  <!-- Remote Metastore address (configure two for high availability) -->
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://h121.wzk.icu:9083,thrift://h123.wzk.icu:9083</value>
  </property>

  <!-- Keep warehouse path consistent -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
</configuration>

Start and Verify

Start Metastore Service (Execute on both h121 and h123)

# Start Metastore service in background
nohup hive --service metastore > /opt/logs/metastore.log 2>&1 &

# Verify port 9083 is listening
lsof -i:9083
# or
netstat -tlnp | grep 9083

Expected output (port listening normally):

COMMAND  PID  USER  TYPE  DEVICE  NAME
java    1234  hadoop TCP  *:9083  (LISTEN)

Client Connection Test (Execute on h122)

hive
-- Test metadata access
SHOW DATABASES;
USE default;
SHOW TABLES;
SELECT * FROM emp LIMIT 5;

If results return normally, remote Metastore is configured successfully.

Common Issues Troubleshooting

Issue 1: Metastore connection timeout

Check if Metastore process is alive on h121/h123:

jps | grep HiveMetaStore

Issue 2: Metadata version mismatch

Re-run initialization:

schematool -dbType mysql -initSchema
# If already exists, upgrade
schematool -dbType mysql -upgradeSchema

Issue 3: Metadata inconsistency across nodes

Ensure h121 and h123 connect to the same MySQL instance, not separate databases. Two Metastore instances share same metadata, consistency guaranteed by MySQL’s concurrency control.

Metastore Sharing with Other Compute Engines

The biggest value of remote Metastore is that it can be shared by multiple compute engines:

Hive  ─┐
Spark ─┤──→ Metastore (thrift://h121:9083)──→ MySQL
Impala─┘

Spark can access the same Hive metadata by configuring spark.hadoop.hive.metastore.uris, enabling cross-engine table sharing. This is the foundation for building a unified data lake.