Offline Data Warehouse Advertising Business Hive ADS Prac...

TL;DR

Scenario: In offline advertising data warehouse, after ADS layer Hive data is loaded daily, it needs to be stably exported to MySQL for query and display.
Conclusion: Using Shell + Hive partition script + DataX JSON parameterization is a simple and reusable offline data export solution.
Output: Provides ADS loading, DataX export, MySQL landing, common error diagnosis and fix checklist.

Import Data

Load ADS Layer

ad_show_page

sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-21

View data in Hive:

hive

use ads;
select * from ads_ad_show_place limit 5;

Continue importing other data:

sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-22
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-23
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-24

The final Hive data volume is as follows:

select count(*) from ads_ad_show_place;

ad_show_page_window

sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-21

View data in Hive as follows:

hive

use ads;
select * from ads_ad_show_place_window limit 5;

Continue loading other data:

sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-22
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-23
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-24

The total data count in Hive is as follows:

select count(*) from ads_ad_show_place_window;

Export Data

Execution Steps

Create corresponding table in MySQL
Create configuration file (JSON)
Execute command, use JSON configuration file, test
Write execution script (Shell)
Test Shell script

MySQL

drop table if exists dwads.ads_ad_show_place;
create table dwads.ads_ad_show_place(
  ad_action tinyint,
  hour varchar(2),
  place varchar(20),
  product_id int,
  cnt int,
  dt varchar(10)
);

DataX

Configuration File

vim /opt/wzk/datax/ads_ad_show_place.json

The content written is as follows:

{
  "job":{
    "setting":{
      "speed":{
        "channel":1
      }
    },
    "content":[
      {
        "reader":{
          "name":"hdfsreader",
          "parameter":{
            "path":"/user/hive/warehouse/ads.db/ads_ad_show_place/dt=$do_date/*",
            "defaultFS":"hdfs://h121.wzk.icu:9000",
            "column":[
              {
                "index":0,
                "type":"string"
              },
              {
                "index":1,
                "type":"string"
              },
              {
                "index":2,
                "type":"string"
              },
              {
                "index":3,
                "type":"string"
              },
              {
                "index":4,
                "type":"string"
              },
              {
                "type":"string",
                "value":"$do_date"
              }
            ],
            "fileType":"text",
            "encoding":"UTF-8",
            "fieldDelimiter":","
          }
        },
        "writer":{
          "name":"mysqlwriter",
          "parameter":{
            "writeMode":"insert",
            "username":"hive",
            "password":"hive@wzk.icu",
            "column":[
              "ad_action",
              "hour",
              "place",
              "product_id",
              "cnt",
              "dt"
            ],
            "preSql":[
              "delete from ads_ad_show_place where dt='$do_date'"
            ],
            "connection":[
              {
                "jdbcUrl":"jdbc:mysql://h122.wzk.icu:3306/dwads?useUnicode=true&characterEncoding=utf-8",
                "table":[
                  "ads_ad_show_place"
                ]
              }
            ]
          }
        }
      }
    ]
  }
}

DataX Introduction

DataX is a distributed offline data synchronization tool open-sourced by Alibaba, used to solve data transmission problems between different storage systems. It is designed as an efficient, stable, and easily extensible tool that can adapt to various complex data synchronization requirements.

Core Features

Supports multiple data sources:

Relational databases: MySQL, PostgreSQL, Oracle, SQL Server, DB2, etc.
NoSQL databases: MongoDB, HBase, etc.
Big data storage systems: Hive, MaxCompute (ODPS), HDFS.
Others: FTP files, ElasticSearch, etc.

High performance:

Based on multi-threaded parallel architecture, fully utilizes multi-core performance of machines.
Supports sharded transmission, improving data transmission throughput.

Flexibility and ease of use:

Configuration file based: Use JSON format configuration files to define tasks, simple and intuitive.
Supports task scheduling, can integrate with scheduling tools to implement scheduled tasks.
Compatible with multiple data formats and transmission protocols.

Extensibility:

Plugin-based architecture, developers can support new data sources by writing Reader and Writer plugins.

Open source and community support:

Based on Apache 2.0 open source license, developers can freely use and modify.
Has active community and rich documentation support.

Components

Reader:

Responsible for reading data from data sources.
Examples: MySQLReader, HdfsReader.

Writer:

Responsible for writing data to target storage.
Examples: MySQLWriter, MySQLWriter.

Framework:

DataX’s core scheduling engine, responsible for coordinating Reader and Writer.
Provides error handling, data validation, performance optimization, etc.

Transform:

Used for processing and transforming transmitted data (optional).
For example, data format conversion, field addition/deletion/modification, etc.

Workflow

Initialization:

Load user-configured JSON file, parse Reader and Writer configuration.
Prepare task context.

Read data:

Reader reads source data and outputs in batches.

Transform data:

Optional step, Transform module processes data.

Write data:

Writer receives data from Reader and writes to target storage.

Task management and monitoring:

DataX provides real-time task running logs and statistics, including speed, success rate, error messages, etc.

Execute Export

vim /opt/wzk/hive/ads_ad_show_place.sh

The content written is as follows:

#!/bin/bash
source /etc/profile
JSON=/opt/wzk/datax
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
python $DATAX_HOME/bin/datax.py -p "-Ddo_date=$do_date" $JSON/ads_ad_show_place.json

Executing the script gets the result:

sh /opt/wzk/hive/ads_ad_show_place.sh 2020-07-21

Error Troubleshooting

Symptom	Root Cause	Fix
DataX starts and immediately reports JSON parsing error	String spans multiple lines in preSql, JSON illegal	Use python -m json.tool or online JSON validator to check config, write `delete from ads_ad_show_place where dt='$do_date'` as single-line valid string
DataX task executes successfully but MySQL has no data	HDFS path partition path doesn’t exist or date variable not replaced	Check if `/user/hive/warehouse/.../dt=$do_date/*` actually has files, confirm `-Ddo_date` parameter passed successfully, and Hive partition directory matches table name
Field count mismatch during export	HDFS read column count doesn’t match MySQL target table column count	Compare reader.column and writer.column count, keep source 5 columns + add 1 dt constant, strictly align with target 6 columns
Running script prompts datax.py not found	Environment variable not loaded or $DATAX_HOME not effective	Check `echo $DATAX_HOME`, add DATAX_HOME in `/etc/profile`, and re-source
MySQL exported duplicate data	Didn’t delete before write by partition, or preSql not effective	Check if duplicate records exist for same dt, keep preSql `delete where dt='$do_date'`, confirm table has dt field
Hive query has no results	ADS layer script didn’t execute successfully or query table name wrong	First execute `show tables`, then check script logs, unify table name and script name, avoid mixing `ad_show_page` and `ads_ad_show_place`
Reader reproduces failure	Naming, code block language markers, screenshots in article not exactly consistent with text	Read through titles, script names, SQL table names, JSON paths to unify naming conventions; use `shell` code blocks for Shell, `sql` code blocks for SQL
Export script executes normally but date is wrong	No parameter passed defaults to previous day, inconsistent with example date	Print do_date value, add `echo $do_date` in script for easy verification, rerun specified date