Hive ODS Layer JSON Parsing: UDF Array Extraction, explod...

TL;DR

Scenario: Processing JSON fields in Hive offline data warehouse, need to extract arrays and expand to multiple rows, also need to be compatible with pure JSON file loading to table
Conclusion: Nested/array priority UDF+explode; pure JSON file priority JsonSerDe; simple fields use json_tuple / get_json_object
Output: Reusable Java UDF (JSON array → array) + 3 groups of directly runnable Hive SQL templates

JSON Data Processing

Use UDF to Process

Custom UDF to process array in JSON string, custom UDF function:

Input: JSON string, array key
Output: String array

UDF (User Defined Function)

UDF is a user-defined function, used to extend functionality of big data processing systems. Through UDF, users can implement specific business logic for data transformation or calculation.

Extensibility

Although big data processing tools like Hive, Spark SQL provide rich built-in functions, actual business scenarios often encounter special needs that built-in functions cannot meet. UDF mechanism allows users to implement custom logic based on business needs, greatly extending data processing capabilities.

Flexibility

UDF supports multiple programming languages: Java, Python, Scala, JavaScript

Main Types

Regular UDF: Single row input, single row output
UDAF: Multiple row input, single row output, aggregation functions
UDTF: Single row input, multiple row output, used with LATERAL VIEW

Write Code

Import Dependencies

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>2.3.7</version>
  <scope>provided</scope>
</dependency>

Write Code

package icu.wzk;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.hadoop.hive.ql.exec.UDF;
import java.util.ArrayList;

public class ParseJsonArray extends UDF {
    public ArrayList<String> evaluate(String jsonStr, String arrKey) {
        if (jsonStr == null || jsonStr.isEmpty()) {
            return null;
        }
        try {
            JSONObject object = JSON.parseObject(jsonStr);
            JSONArray jsonArray = object.getJSONArray(arrKey);
            ArrayList<String> result = new ArrayList<>();
            for (Object o : jsonArray) {
                result.add(o.toString());
            }
            return result;
        } catch (Exception e) {
            return null;
        }
    }
}

Test Function

-- Add custom jar
add jar /opt/wzk/hive-parse-json-array-1.0-SNAPSHOT-jar-with-dependencies.jar;

-- Create temporary function
CREATE temporary function wzk_json_array AS "icu.wzk.ParseJsonArray";

-- Execute query test
SELECT username, age, sex, wzk_json_array(json, "ids") ids FROM jsont1;

-- Parse array in json string, and expand
SELECT username, age, sex, ids1
FROM jsont1
lateral view explode(wzk_json_array(json, "ids")) t1 AS ids1;

Use SerDe to Process

SerDe (Serializer and Deserializer)

SerDe is abbreviation for serialization and deserialization, used to define data read/write methods.

SerDe Function

Deserialization: Parse byte stream in storage into row data in Hive table
Serialization: Convert row data in Hive table into storage format
Support custom data formats

Create Data

{"id": 1,"ids": [101,102,103],"total_number": 3}
{"id": 2,"ids": [201,202,203,204],"total_number": 4}

Test

CREATE TABLE jsont2(
  id int,
  ids array<string>,
  total_number int
) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

load data local inpath '/opt/wzk/json2.dat' into table jsont2;

Final Summary

For simple format JSON data, use get_json_object, json_tuple to process
For nested data types, can use UDF
For pure JSON strings, using JsonSerDe is simpler

Error Quick Reference

Symptom	Root Cause	Fix
UDF returns all NULL	jsonStr empty/invalid JSON	First check jsonStr is not empty
Results lose rows after explode	When UDF returns null, explode doesn’t produce rows	Return empty array for missing fields
ClassNotFound	jar not distributed to execution node	Use fat-jar with dependencies
Query all NULL after JsonSerDe table created	Field name/type doesn’t match JSON	Ensure each line is complete JSON