TL;DR

  • Scenario: ODS layer landing in offline data warehouse, Hive external table receives logs, daily partition management, JSON fields parsed and expanded as needed
  • Conclusion: ODS maintains “source-aligned + partitioned + external table”, prefer json_tuple for batch field extraction, use explode + lateral view for expansion
  • Output: ODS table creation template, partition add/delete SQL, daily loading script skeleton, JSON single value/array/expansion query patterns

Overall Structure

ODS Layer

ODS Definition

ODS is an operation-oriented data storage, focused on supporting daily operations of enterprises. It typically integrates data from different data sources (such as ERP, CRM, sales systems, etc.), cleans and transforms for storage for quick querying and usage. ODS is not primarily for historical data analysis, but rather for operations and transaction processing.

ODS Characteristics

  • Real-time or Near Real-time: Data in ODS is typically real-time or near real-time, can quickly reflect latest status in business systems, support timely decision-making for enterprises.
  • Data Integration: ODS integrates data from multiple source systems, provides unified view. It processes data through ETL (Extract, Transform, Load) tools.
  • Non-historical: ODS data has relatively short lifecycle, usually only stores current or recent data, does not store historical data for long.
  • Operation-oriented: ODS design focuses on supporting operations and transaction processing, such as quickly querying current customer order status, rather than complex analysis or modeling.

ODS Architecture

ODS usually contains the following main parts:

  • Data Input Layer: Extracts raw data from transaction systems (may come from different data sources). Uses ETL tools for preliminary data cleaning and transformation.
  • Data Storage Layer: Stores integrated data, usually organized in subject-oriented form. Data storage can be relational databases (like MySQL, PostgreSQL) or other databases supporting fast queries.
  • Data Access Layer: Provides fast query interfaces for operational applications. May support API, SQL queries or direct integration into enterprise applications.

Create ODS Layer

use ods;
create external table ods.ods_start_log(
  `str` string
) comment 'User startup log information'
partitioned by (`dt` string)
location '/user/data/logs/start';

Load ODS Layer Data

#!/bin/bash
APP=ods
source /etc/profile
# Can input date; if not input date, use yesterday's date
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
# Define SQL to execute
sql="
alter table "$APP".ods_start_log add partition(dt='$do_date');
"
hive -e "$sql"

JSON Data Processing

Each line in data file must be a complete JSON string, one JSON string cannot span multiple lines. There are three main methods for Hive to process JSON data:

  1. Use built-in functions get_json_object, json_tuple
  2. Use custom UDF
  3. Third-party SerDe

Built-in Functions

get_json_object(string json_string, string path)

  • Return Value: String
  • Description: Parse JSON string json_string, return content specified by path, returns NULL if input JSON string is invalid, function returns only one data item each time.

json_tuple(jsonStr, k1, k2…)

  • Return Value: All input parameters, output parameters are String
  • Description: Parameter is a set of k1, k2, etc. and JSON string, return tuple. This method is more efficient than get_json_object, so can input multiple keys in one call.

explode

Use explode to split complex array or map structures in one Hive row into multiple rows.

JSON Data Processing Example

-- Get single-layer value
SELECT
  username, age, sex,
  get_json_object(json, "$.id") id,
  get_json_object(json, "$.ids") ids,
  get_json_object(json, "$.total_number") num
FROM jsont1;
-- Use json_tuple to process multiple fields at once
SELECT
  json_tuple(json, 'id', 'ids', 'total_number')
FROM jsont1;
-- Use explode + lateral view
SELECT username, age, sex, id, ids, num
FROM jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1
AS id, ids, num;

with tmp AS (
SELECT username, age, sex, id, ids, num
FROM jsont1
lateral view json_tuple(json, 'id', 'ids', 'total_number') t1 AS id, ids, num)
SELECT username, age, sex, id, ids1, num
FROM tmp
lateral view explode(split(regexp_replace(ids, "\\[|\\]", ""), ",")) t1 AS ids1;

Error Quick Reference

SymptomRoot CauseLocateFix
Partition added but no data foundPartition directory doesn’t exist or no data in directory; just add partition doesn’t equal importing datashow partitions ods_start_log; check partitions; check HDFSFirst land data to corresponding partition directory
No data under LOCATION ‘/user/data/logs/start’ or permission errorHDFS path not created/insufficient permissionshdfs dfs -ls /user/data/logs/startCreate directory and authorize: hdfs dfs -mkdir -p …; hdfs dfs -chmod/-chown …
External table dropped but data still existsExternal table metadata separated from datadesc formatted to see Table TypeClarify external table semantics: dropping table doesn’t delete HDFS files
JSON parsing result all NULLJSON string invalid; path wrongSpot check: select json from jsont1 limit 5;Ensure each line is complete JSON string without extra characters