TL;DR

  • Scenario: In offline advertising data warehouse, after ADS layer Hive data is loaded daily, it needs to be stably exported to MySQL for query and display.
  • Conclusion: Using Shell + Hive partition script + DataX JSON parameterization is a simple and reusable offline data export solution.
  • Output: Provides ADS loading, DataX export, MySQL landing, common error diagnosis and fix checklist.

Import Data

Load ADS Layer

ad_show_page

sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-21

View data in Hive:

hive

use ads;
select * from ads_ad_show_place limit 5;

Continue importing other data:

sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-22
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-23
sh /opt/wzk/hive/ads_load_ad_show_page.sh 2020-07-24

The final Hive data volume is as follows:

select count(*) from ads_ad_show_place;

ad_show_page_window

sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-21

View data in Hive as follows:

hive

use ads;
select * from ads_ad_show_place_window limit 5;

Continue loading other data:

sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-22
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-23
sh /opt/wzk/hive/ads_load_ad_show_page_window.sh 2020-07-24

The total data count in Hive is as follows:

select count(*) from ads_ad_show_place_window;

Export Data

Execution Steps

  • Create corresponding table in MySQL
  • Create configuration file (JSON)
  • Execute command, use JSON configuration file, test
  • Write execution script (Shell)
  • Test Shell script

MySQL

drop table if exists dwads.ads_ad_show_place;
create table dwads.ads_ad_show_place(
  ad_action tinyint,
  hour varchar(2),
  place varchar(20),
  product_id int,
  cnt int,
  dt varchar(10)
);

DataX

Configuration File

vim /opt/wzk/datax/ads_ad_show_place.json

The content written is as follows:

{
  "job":{
    "setting":{
      "speed":{
        "channel":1
      }
    },
    "content":[
      {
        "reader":{
          "name":"hdfsreader",
          "parameter":{
            "path":"/user/hive/warehouse/ads.db/ads_ad_show_place/dt=$do_date/*",
            "defaultFS":"hdfs://h121.wzk.icu:9000",
            "column":[
              {
                "index":0,
                "type":"string"
              },
              {
                "index":1,
                "type":"string"
              },
              {
                "index":2,
                "type":"string"
              },
              {
                "index":3,
                "type":"string"
              },
              {
                "index":4,
                "type":"string"
              },
              {
                "type":"string",
                "value":"$do_date"
              }
            ],
            "fileType":"text",
            "encoding":"UTF-8",
            "fieldDelimiter":","
          }
        },
        "writer":{
          "name":"mysqlwriter",
          "parameter":{
            "writeMode":"insert",
            "username":"hive",
            "password":"hive@wzk.icu",
            "column":[
              "ad_action",
              "hour",
              "place",
              "product_id",
              "cnt",
              "dt"
            ],
            "preSql":[
              "delete from ads_ad_show_place where dt='$do_date'"
            ],
            "connection":[
              {
                "jdbcUrl":"jdbc:mysql://h122.wzk.icu:3306/dwads?useUnicode=true&characterEncoding=utf-8",
                "table":[
                  "ads_ad_show_place"
                ]
              }
            ]
          }
        }
      }
    ]
  }
}

DataX Introduction

DataX is a distributed offline data synchronization tool open-sourced by Alibaba, used to solve data transmission problems between different storage systems. It is designed as an efficient, stable, and easily extensible tool that can adapt to various complex data synchronization requirements.

Core Features

Supports multiple data sources:

  • Relational databases: MySQL, PostgreSQL, Oracle, SQL Server, DB2, etc.
  • NoSQL databases: MongoDB, HBase, etc.
  • Big data storage systems: Hive, MaxCompute (ODPS), HDFS.
  • Others: FTP files, ElasticSearch, etc.

High performance:

  • Based on multi-threaded parallel architecture, fully utilizes multi-core performance of machines.
  • Supports sharded transmission, improving data transmission throughput.

Flexibility and ease of use:

  • Configuration file based: Use JSON format configuration files to define tasks, simple and intuitive.
  • Supports task scheduling, can integrate with scheduling tools to implement scheduled tasks.
  • Compatible with multiple data formats and transmission protocols.

Extensibility:

  • Plugin-based architecture, developers can support new data sources by writing Reader and Writer plugins.

Open source and community support:

  • Based on Apache 2.0 open source license, developers can freely use and modify.
  • Has active community and rich documentation support.

Components

Reader:

  • Responsible for reading data from data sources.
  • Examples: MySQLReader, HdfsReader.

Writer:

  • Responsible for writing data to target storage.
  • Examples: MySQLWriter, MySQLWriter.

Framework:

  • DataX’s core scheduling engine, responsible for coordinating Reader and Writer.
  • Provides error handling, data validation, performance optimization, etc.

Transform:

  • Used for processing and transforming transmitted data (optional).
  • For example, data format conversion, field addition/deletion/modification, etc.

Workflow

Initialization:

  • Load user-configured JSON file, parse Reader and Writer configuration.
  • Prepare task context.

Read data:

  • Reader reads source data and outputs in batches.

Transform data:

  • Optional step, Transform module processes data.

Write data:

  • Writer receives data from Reader and writes to target storage.

Task management and monitoring:

  • DataX provides real-time task running logs and statistics, including speed, success rate, error messages, etc.

Execute Export

vim /opt/wzk/hive/ads_ad_show_place.sh

The content written is as follows:

#!/bin/bash
source /etc/profile
JSON=/opt/wzk/datax
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
python $DATAX_HOME/bin/datax.py -p "-Ddo_date=$do_date" $JSON/ads_ad_show_place.json

Executing the script gets the result:

sh /opt/wzk/hive/ads_ad_show_place.sh 2020-07-21

Error Troubleshooting

SymptomRoot CauseFix
DataX starts and immediately reports JSON parsing errorString spans multiple lines in preSql, JSON illegalUse python -m json.tool or online JSON validator to check config, write delete from ads_ad_show_place where dt='$do_date' as single-line valid string
DataX task executes successfully but MySQL has no dataHDFS path partition path doesn’t exist or date variable not replacedCheck if /user/hive/warehouse/.../dt=$do_date/* actually has files, confirm -Ddo_date parameter passed successfully, and Hive partition directory matches table name
Field count mismatch during exportHDFS read column count doesn’t match MySQL target table column countCompare reader.column and writer.column count, keep source 5 columns + add 1 dt constant, strictly align with target 6 columns
Running script prompts datax.py not foundEnvironment variable not loaded or $DATAX_HOME not effectiveCheck echo $DATAX_HOME, add DATAX_HOME in /etc/profile, and re-source
MySQL exported duplicate dataDidn’t delete before write by partition, or preSql not effectiveCheck if duplicate records exist for same dt, keep preSql delete where dt='$do_date', confirm table has dt field
Hive query has no resultsADS layer script didn’t execute successfully or query table name wrongFirst execute show tables, then check script logs, unify table name and script name, avoid mixing ad_show_page and ads_ad_show_place
Reader reproduces failureNaming, code block language markers, screenshots in article not exactly consistent with textRead through titles, script names, SQL table names, JSON paths to unify naming conventions; use shell code blocks for Shell, sql code blocks for SQL
Export script executes normally but date is wrongNo parameter passed defaults to previous day, inconsistent with example datePrint do_date value, add echo $do_date in script for easy verification, rerun specified date