TL;DR
- Scenario: In offline advertising data warehouse, after ADS layer Hive data is loaded daily, it needs to be stably exported to MySQL for query and display.
- Conclusion: Using Shell + Hive partition script + DataX JSON parameterization is a simple and reusable offline data export solution.
- Output: Provides ADS loading, DataX export, MySQL landing, common error diagnosis and fix checklist.
Import Data
Load ADS Layer
ad_show_page
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-21
View data in Hive:
hive
use ads;
select * from ads_ad_show_place limit 5;
Continue importing other data:
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-22
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-23
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-24
The final Hive data volume is as follows:
select count(*) from ads_ad_show_place;
ad_show_page_window
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-21
View data in Hive as follows:
hive
use ads;
select * from ads_ad_show_place_window limit 5;
Continue loading other data:
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-22
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-23
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-24
The total data count in Hive is as follows:
select count(*) from ads_ad_show_place_window;
Export Data
Execution Steps
- Create corresponding table in MySQL
- Create configuration file (JSON)
- Execute command, use JSON configuration file, test
- Write execution script (Shell)
- Test Shell script
MySQL
drop table if exists dwads.ads_ad_show_place;
create table dwads.ads_ad_show_place(
ad_action tinyint,
hour varchar(2),
place varchar(20),
product_id int,
cnt int,
dt varchar(10)
);
DataX
Configuration File
vim /opt/wzk/datax/ads_ad_show_place.json
The content written is as follows:
{
"job":{
"setting":{
"speed":{
"channel":1
}
},
"content":[
{
"reader":{
"name":"hdfsreader",
"parameter":{
"path":"/user/hive/warehouse/ads.db/ads_ad_show_place/dt=$do_date/*",
"defaultFS":"hdfs://h121.wzk.icu:9000",
"column":[
{
"index":0,
"type":"string"
},
{
"index":1,
"type":"string"
},
{
"index":2,
"type":"string"
},
{
"index":3,
"type":"string"
},
{
"index":4,
"type":"string"
},
{
"type":"string",
"value":"$do_date"
}
],
"fileType":"text",
"encoding":"UTF-8",
"fieldDelimiter":","
}
},
"writer":{
"name":"mysqlwriter",
"parameter":{
"writeMode":"insert",
"username":"hive",
"password":"hive@wzk.icu",
"column":[
"ad_action",
"hour",
"place",
"product_id",
"cnt",
"dt"
],
"preSql":[
"delete from ads_ad_show_place where dt='$do_date'"
],
"connection":[
{
"jdbcUrl":"jdbc:mysql://h122.wzk.icu:3306/dwads?useUnicode=true&characterEncoding=utf-8",
"table":[
"ads_ad_show_place"
]
}
]
}
}
}
]
}
}
DataX Introduction
DataX is a distributed offline data synchronization tool open-sourced by Alibaba, used to solve data transmission problems between different storage systems. It is designed as an efficient, stable, and easily extensible tool that can adapt to various complex data synchronization requirements.
Core Features
Supports multiple data sources:
- Relational databases: MySQL, PostgreSQL, Oracle, SQL Server, DB2, etc.
- NoSQL databases: MongoDB, HBase, etc.
- Big data storage systems: Hive, MaxCompute (ODPS), HDFS.
- Others: FTP files, ElasticSearch, etc.
High performance:
- Based on multi-threaded parallel architecture, fully utilizes multi-core performance of machines.
- Supports sharded transmission, improving data transmission throughput.
Flexibility and ease of use:
- Configuration file based: Use JSON format configuration files to define tasks, simple and intuitive.
- Supports task scheduling, can integrate with scheduling tools to implement scheduled tasks.
- Compatible with multiple data formats and transmission protocols.
Extensibility:
- Plugin-based architecture, developers can support new data sources by writing Reader and Writer plugins.
Open source and community support:
- Based on Apache 2.0 open source license, developers can freely use and modify.
- Has active community and rich documentation support.
Components
Reader:
- Responsible for reading data from data sources.
- Examples: MySQLReader, HdfsReader.
Writer:
- Responsible for writing data to target storage.
- Examples: MySQLWriter, MySQLWriter.
Framework:
- DataX’s core scheduling engine, responsible for coordinating Reader and Writer.
- Provides error handling, data validation, performance optimization, etc.
Transform:
- Used for processing and transforming transmitted data (optional).
- For example, data format conversion, field addition/deletion/modification, etc.
Workflow
Initialization:
- Load user-configured JSON file, parse Reader and Writer configuration.
- Prepare task context.
Read data:
- Reader reads source data and outputs in batches.
Transform data:
- Optional step, Transform module processes data.
Write data:
- Writer receives data from Reader and writes to target storage.
Task management and monitoring:
- DataX provides real-time task running logs and statistics, including speed, success rate, error messages, etc.
Execute Export
vim /opt/wzk/hive/ads_ad_show_place.sh
The content written is as follows:
#!/bin/bash
source /etc/profile
JSON=/opt/wzk/datax
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
python $DATAX_HOME/bin/datax.py -p "-Ddo_date=$do_date" $JSON/ads_ad_show_place.json
Executing the script gets the result:
sh /opt/wzk/hive/ads_ad_show_place.sh 2020-07-21
Error Troubleshooting
| Symptom | Root Cause | Fix |
|---|---|---|
| DataX starts and immediately reports JSON parsing error | String spans multiple lines in preSql, JSON illegal | Use python -m json.tool or online JSON validator to check config, write delete from ads_ad_show_place where dt='$do_date' as single-line valid string |
| DataX task executes successfully but MySQL has no data | HDFS path partition path doesn’t exist or date variable not replaced | Check if /user/hive/warehouse/.../dt=$do_date/* actually has files, confirm -Ddo_date parameter passed successfully, and Hive partition directory matches table name |
| Field count mismatch during export | HDFS read column count doesn’t match MySQL target table column count | Compare reader.column and writer.column count, keep source 5 columns + add 1 dt constant, strictly align with target 6 columns |
| Running script prompts datax.py not found | Environment variable not loaded or $DATAX_HOME not effective | Check echo $DATAX_HOME, add DATAX_HOME in /etc/profile, and re-source |
| MySQL exported duplicate data | Didn’t delete before write by partition, or preSql not effective | Check if duplicate records exist for same dt, keep preSql delete where dt='$do_date', confirm table has dt field |
| Hive query has no results | ADS layer script didn’t execute successfully or query table name wrong | First execute show tables, then check script logs, unify table name and script name, avoid mixing ad_show_page and ads_ad_show_place |
| Reader reproduces failure | Naming, code block language markers, screenshots in article not exactly consistent with text | Read through titles, script names, SQL table names, JSON paths to unify naming conventions; use shell code blocks for Shell, sql code blocks for SQL |
| Export script executes normally but date is wrong | No parameter passed defaults to previous day, inconsistent with example date | Print do_date value, add echo $do_date in script for easy verification, rerun specified date |