TL;DR

  • Scenario: Processing JSON fields in Hive offline data warehouse, need to extract arrays and expand to multiple rows, also need to be compatible with pure JSON file loading to table
  • Conclusion: Nested/array priority UDF+explode; pure JSON file priority JsonSerDe; simple fields use json_tuple / get_json_object
  • Output: Reusable Java UDF (JSON array → array) + 3 groups of directly runnable Hive SQL templates

JSON Data Processing

Use UDF to Process

Custom UDF to process array in JSON string, custom UDF function:

  • Input: JSON string, array key
  • Output: String array

UDF (User Defined Function)

UDF is a user-defined function, used to extend functionality of big data processing systems. Through UDF, users can implement specific business logic for data transformation or calculation.

Extensibility

Although big data processing tools like Hive, Spark SQL provide rich built-in functions, actual business scenarios often encounter special needs that built-in functions cannot meet. UDF mechanism allows users to implement custom logic based on business needs, greatly extending data processing capabilities.

Flexibility

UDF supports multiple programming languages: Java, Python, Scala, JavaScript

Main Types

  1. Regular UDF: Single row input, single row output
  2. UDAF: Multiple row input, single row output, aggregation functions
  3. UDTF: Single row input, multiple row output, used with LATERAL VIEW

Write Code

Import Dependencies

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>2.3.7</version>
  <scope>provided</scope>
</dependency>

Write Code

package icu.wzk;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.hadoop.hive.ql.exec.UDF;
import java.util.ArrayList;

public class ParseJsonArray extends UDF {
    public ArrayList<String> evaluate(String jsonStr, String arrKey) {
        if (jsonStr == null || jsonStr.isEmpty()) {
            return null;
        }
        try {
            JSONObject object = JSON.parseObject(jsonStr);
            JSONArray jsonArray = object.getJSONArray(arrKey);
            ArrayList<String> result = new ArrayList<>();
            for (Object o : jsonArray) {
                result.add(o.toString());
            }
            return result;
        } catch (Exception e) {
            return null;
        }
    }
}

Test Function

-- Add custom jar
add jar /opt/wzk/hive-parse-json-array-1.0-SNAPSHOT-jar-with-dependencies.jar;

-- Create temporary function
CREATE temporary function wzk_json_array AS "icu.wzk.ParseJsonArray";

-- Execute query test
SELECT username, age, sex, wzk_json_array(json, "ids") ids FROM jsont1;

-- Parse array in json string, and expand
SELECT username, age, sex, ids1
FROM jsont1
lateral view explode(wzk_json_array(json, "ids")) t1 AS ids1;

Use SerDe to Process

SerDe (Serializer and Deserializer)

SerDe is abbreviation for serialization and deserialization, used to define data read/write methods.

SerDe Function

  • Deserialization: Parse byte stream in storage into row data in Hive table
  • Serialization: Convert row data in Hive table into storage format
  • Support custom data formats

Create Data

{"id": 1,"ids": [101,102,103],"total_number": 3}
{"id": 2,"ids": [201,202,203,204],"total_number": 4}

Test

CREATE TABLE jsont2(
  id int,
  ids array<string>,
  total_number int
) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

load data local inpath '/opt/wzk/json2.dat' into table jsont2;

Final Summary

  • For simple format JSON data, use get_json_object, json_tuple to process
  • For nested data types, can use UDF
  • For pure JSON strings, using JsonSerDe is simpler

Error Quick Reference

SymptomRoot CauseFix
UDF returns all NULLjsonStr empty/invalid JSONFirst check jsonStr is not empty
Results lose rows after explodeWhen UDF returns null, explode doesn’t produce rowsReturn empty array for missing fields
ClassNotFoundjar not distributed to execution nodeUse fat-jar with dependencies
Query all NULL after JsonSerDe table createdField name/type doesn’t match JSONEnsure each line is complete JSON