This is article 15 in the Big Data series. Deep dive into Hive Metastore’s role, differences between three deployment modes, and practical steps to configure remote Metastore high availability in production clusters.
Complete illustrated version: CSDN Original | Juejin
What is Metastore
Hive Metastore is the core component for managing metadata, responsible for storing and maintaining:
- Database, table, column definitions (Schema)
- Partition information
- Data storage path mapping in HDFS
- Table statistics (used for query optimization)
- SerDe (serialization/deserialization) configuration
Metastore persists metadata to relational database (MySQL/MariaDB in production) and provides services through Thrift interface.
Three Deployment Modes
Mode 1: Embedded Mode
HiveServer2 Process
└── Metastore (Embedded)
└── Derby (Embedded Database)
- Derby database runs embedded within Hive process
- Only supports single user, no concurrent access
- Metadata stored on local disk (
metastore_db/directory) - Only suitable for local single-machine testing, not for multi-node clusters
Mode 2: Local Mode
HiveServer2 Process
└── Metastore (Same process)
└── MySQL/MariaDB (Separate Database)
- Metastore service runs in same JVM process as HiveServer2
- Metastore automatically starts when HiveServer2 starts
- Supports multiple users, metadata stored in external MySQL
- Suitable for development/testing environments
Mode 3: Remote Mode — Production Recommended
Multiple Clients
↓ Thrift
Standalone Metastore Service (port 9083)
└── MySQL/MariaDB
- Metastore runs as standalone process, provides services through Thrift
- Clients (HiveServer2, Spark, Impala, etc.) connect via
hive.metastore.uris - Supports multiple Metastore instances for high availability
- Metadata storage separated from computation, stable and reliable
Comparison of Three Modes
| Feature | Embedded | Local | Remote |
|---|---|---|---|
| Concurrent Support | Single user | Multiple users | Multiple users |
| Metadata DB | Derby (embedded) | MySQL (separate) | MySQL (separate) |
| Service Process | Same process | Same process | Separate process |
| High Availability | No | No | Supports multiple instances |
| Use Case | Local testing | Dev testing | Production |
Cluster Environment Configuration (Remote Mode)
Example cluster:
- h121: Running Metastore service + NameNode
- h122: Hive client node
- h123: Running Metastore service + DataNode
h121 and h123 (Metastore Service Nodes) Configuration
hive-site.xml needs MySQL connection info and Metastore configuration:
<configuration>
<!-- MySQL connection -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://h121.wzk.icu:3306/hive_meta?useSSL=false&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive123</value>
</property>
<!-- Warehouse path -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
</configuration>
h122 (Client Node) Configuration
Client doesn’t need MySQL connection info, just point to remote Metastore address:
<configuration>
<!-- Remote Metastore address (configure two for high availability) -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://h121.wzk.icu:9083,thrift://h123.wzk.icu:9083</value>
</property>
<!-- Keep warehouse path consistent -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
</configuration>
Start and Verify
Start Metastore Service (Execute on both h121 and h123)
# Start Metastore service in background
nohup hive --service metastore > /opt/logs/metastore.log 2>&1 &
# Verify port 9083 is listening
lsof -i:9083
# or
netstat -tlnp | grep 9083
Expected output (port listening normally):
COMMAND PID USER TYPE DEVICE NAME
java 1234 hadoop TCP *:9083 (LISTEN)
Client Connection Test (Execute on h122)
hive
-- Test metadata access
SHOW DATABASES;
USE default;
SHOW TABLES;
SELECT * FROM emp LIMIT 5;
If results return normally, remote Metastore is configured successfully.
Common Issues Troubleshooting
Issue 1: Metastore connection timeout
Check if Metastore process is alive on h121/h123:
jps | grep HiveMetaStore
Issue 2: Metadata version mismatch
Re-run initialization:
schematool -dbType mysql -initSchema
# If already exists, upgrade
schematool -dbType mysql -upgradeSchema
Issue 3: Metadata inconsistency across nodes
Ensure h121 and h123 connect to the same MySQL instance, not separate databases. Two Metastore instances share same metadata, consistency guaranteed by MySQL’s concurrency control.
Metastore Sharing with Other Compute Engines
The biggest value of remote Metastore is that it can be shared by multiple compute engines:
Hive ─┐
Spark ─┤──→ Metastore (thrift://h121:9083)──→ MySQL
Impala─┘
Spark can access the same Hive metadata by configuring spark.hadoop.hive.metastore.uris, enabling cross-engine table sharing. This is the foundation for building a unified data lake.