TL;DR
- Scenario: Processing JSON fields in Hive offline data warehouse, need to extract arrays and expand to multiple rows, also need to be compatible with pure JSON file loading to table
- Conclusion: Nested/array priority UDF+explode; pure JSON file priority JsonSerDe; simple fields use json_tuple / get_json_object
- Output: Reusable Java UDF (JSON array → array) + 3 groups of directly runnable Hive SQL templates
JSON Data Processing
Use UDF to Process
Custom UDF to process array in JSON string, custom UDF function:
- Input: JSON string, array key
- Output: String array
UDF (User Defined Function)
UDF is a user-defined function, used to extend functionality of big data processing systems. Through UDF, users can implement specific business logic for data transformation or calculation.
Extensibility
Although big data processing tools like Hive, Spark SQL provide rich built-in functions, actual business scenarios often encounter special needs that built-in functions cannot meet. UDF mechanism allows users to implement custom logic based on business needs, greatly extending data processing capabilities.
Flexibility
UDF supports multiple programming languages: Java, Python, Scala, JavaScript
Main Types
- Regular UDF: Single row input, single row output
- UDAF: Multiple row input, single row output, aggregation functions
- UDTF: Single row input, multiple row output, used with LATERAL VIEW
Write Code
Import Dependencies
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.7</version>
<scope>provided</scope>
</dependency>
Write Code
package icu.wzk;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.hadoop.hive.ql.exec.UDF;
import java.util.ArrayList;
public class ParseJsonArray extends UDF {
public ArrayList<String> evaluate(String jsonStr, String arrKey) {
if (jsonStr == null || jsonStr.isEmpty()) {
return null;
}
try {
JSONObject object = JSON.parseObject(jsonStr);
JSONArray jsonArray = object.getJSONArray(arrKey);
ArrayList<String> result = new ArrayList<>();
for (Object o : jsonArray) {
result.add(o.toString());
}
return result;
} catch (Exception e) {
return null;
}
}
}
Test Function
-- Add custom jar
add jar /opt/wzk/hive-parse-json-array-1.0-SNAPSHOT-jar-with-dependencies.jar;
-- Create temporary function
CREATE temporary function wzk_json_array AS "icu.wzk.ParseJsonArray";
-- Execute query test
SELECT username, age, sex, wzk_json_array(json, "ids") ids FROM jsont1;
-- Parse array in json string, and expand
SELECT username, age, sex, ids1
FROM jsont1
lateral view explode(wzk_json_array(json, "ids")) t1 AS ids1;
Use SerDe to Process
SerDe (Serializer and Deserializer)
SerDe is abbreviation for serialization and deserialization, used to define data read/write methods.
SerDe Function
- Deserialization: Parse byte stream in storage into row data in Hive table
- Serialization: Convert row data in Hive table into storage format
- Support custom data formats
Create Data
{"id": 1,"ids": [101,102,103],"total_number": 3}
{"id": 2,"ids": [201,202,203,204],"total_number": 4}
Test
CREATE TABLE jsont2(
id int,
ids array<string>,
total_number int
) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
load data local inpath '/opt/wzk/json2.dat' into table jsont2;
Final Summary
- For simple format JSON data, use get_json_object, json_tuple to process
- For nested data types, can use UDF
- For pure JSON strings, using JsonSerDe is simpler
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| UDF returns all NULL | jsonStr empty/invalid JSON | First check jsonStr is not empty |
| Results lose rows after explode | When UDF returns null, explode doesn’t produce rows | Return empty array for missing fields |
| ClassNotFound | jar not distributed to execution node | Use fat-jar with dependencies |
| Query all NULL after JsonSerDe table created | Field name/type doesn’t match JSON | Ensure each line is complete JSON |