TL;DR
- Scenario: ODS layer landing in offline data warehouse, Hive external table receives logs, daily partition management, JSON fields parsed and expanded as needed
- Conclusion: ODS maintains “source-aligned + partitioned + external table”, prefer json_tuple for batch field extraction, use explode + lateral view for expansion
- Output: ODS table creation template, partition add/delete SQL, daily loading script skeleton, JSON single value/array/expansion query patterns
Overall Structure
ODS Layer
ODS Definition
ODS is an operation-oriented data storage, focused on supporting daily operations of enterprises. It typically integrates data from different data sources (such as ERP, CRM, sales systems, etc.), cleans and transforms for storage for quick querying and usage. ODS is not primarily for historical data analysis, but rather for operations and transaction processing.
ODS Characteristics
- Real-time or Near Real-time: Data in ODS is typically real-time or near real-time, can quickly reflect latest status in business systems, support timely decision-making for enterprises.
- Data Integration: ODS integrates data from multiple source systems, provides unified view. It processes data through ETL (Extract, Transform, Load) tools.
- Non-historical: ODS data has relatively short lifecycle, usually only stores current or recent data, does not store historical data for long.
- Operation-oriented: ODS design focuses on supporting operations and transaction processing, such as quickly querying current customer order status, rather than complex analysis or modeling.
ODS Architecture
ODS usually contains the following main parts:
- Data Input Layer: Extracts raw data from transaction systems (may come from different data sources). Uses ETL tools for preliminary data cleaning and transformation.
- Data Storage Layer: Stores integrated data, usually organized in subject-oriented form. Data storage can be relational databases (like MySQL, PostgreSQL) or other databases supporting fast queries.
- Data Access Layer: Provides fast query interfaces for operational applications. May support API, SQL queries or direct integration into enterprise applications.
Create ODS Layer
use ods;
create external table ods.ods_start_log(
`str` string
) comment 'User startup log information'
partitioned by (`dt` string)
location '/user/data/logs/start';
Load ODS Layer Data
#!/bin/bash
APP=ods
source /etc/profile
# Can input date; if not input date, use yesterday's date
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
# Define SQL to execute
sql="
alter table "$APP".ods_start_log add partition(dt='$do_date');
"
hive -e "$sql"
JSON Data Processing
Each line in data file must be a complete JSON string, one JSON string cannot span multiple lines. There are three main methods for Hive to process JSON data:
- Use built-in functions get_json_object, json_tuple
- Use custom UDF
- Third-party SerDe
Built-in Functions
get_json_object(string json_string, string path)
- Return Value: String
- Description: Parse JSON string json_string, return content specified by path, returns NULL if input JSON string is invalid, function returns only one data item each time.
json_tuple(jsonStr, k1, k2…)
- Return Value: All input parameters, output parameters are String
- Description: Parameter is a set of k1, k2, etc. and JSON string, return tuple. This method is more efficient than get_json_object, so can input multiple keys in one call.
explode
Use explode to split complex array or map structures in one Hive row into multiple rows.
JSON Data Processing Example
-- Get single-layer value
SELECT
username, age, sex,
get_json_object(json, "$.id") id,
get_json_object(json, "$.ids") ids,
get_json_object(json, "$.total_number") num
FROM jsont1;
-- Use json_tuple to process multiple fields at once
SELECT
json_tuple(json, 'id', 'ids', 'total_number')
FROM jsont1;
-- Use explode + lateral view
SELECT username, age, sex, id, ids, num
FROM jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1
AS id, ids, num;
with tmp AS (
SELECT username, age, sex, id, ids, num
FROM jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 AS id, ids, num)
SELECT username, age, sex, id, ids1, num
FROM tmp
lateral view explode(split(regexp_replace(ids, "\\[|\\]", ""), ",")) t1 AS ids1;
Error Quick Reference
| Symptom | Root Cause | Locate | Fix |
|---|---|---|---|
| Partition added but no data found | Partition directory doesn’t exist or no data in directory; just add partition doesn’t equal importing data | show partitions ods_start_log; check partitions; check HDFS | First land data to corresponding partition directory |
| No data under LOCATION ‘/user/data/logs/start’ or permission error | HDFS path not created/insufficient permissions | hdfs dfs -ls /user/data/logs/start | Create directory and authorize: hdfs dfs -mkdir -p …; hdfs dfs -chmod/-chown … |
| External table dropped but data still exists | External table metadata separated from data | desc formatted to see Table Type | Clarify external table semantics: dropping table doesn’t delete HDFS files |
| JSON parsing result all NULL | JSON string invalid; path wrong | Spot check: select json from jsont1 limit 5; | Ensure each line is complete JSON string without extra characters |