Article Overview

This article introduces the complete solution for exporting Hive ADS layer tables to MySQL using DataX’s hdfsreader + mysqlwriter combination.

Architecture Description

The article describes the overall offline data warehouse process: ODS -> DWD -> DWS -> ADS -> DataX export to MySQL. The ADS layer has 4 tables to export:

  • ads.ads_member_active_count (Active Member Count)
  • ads.ads_member_retention_count (Member Retention Count)
  • ads.ads_member_retention_rate (Member Retention Rate)
  • ads.ads_new_member_cnt (New Member Count)

MySQL Table Creation Statement

-- Create database and table
create database dwads;

-- Active member count table
create table dwads.ads_member_active_count(
  `dt` varchar(10) COMMENT 'Statistics date',
  `day_count` int COMMENT 'Member count for the day',
  `week_count` int COMMENT 'Member count for the week',
  `month_count` int COMMENT 'Member count for the month',
  primary key (dt)
);

-- New member count table
create table dwads.ads_new_member_cnt(
  `dt` varchar(10) COMMENT 'Statistics date',
  `cnt` int,
  primary key (dt)
);

-- Member retention count table
create table dwads.ads_member_retention_count(
  `dt` varchar(10) COMMENT 'Statistics date',
  `add_date` varchar(10) comment 'Addition date',
  `retention_day` int comment 'Retention days up to current date',
  `retention_count` bigint comment 'Retention count',
  primary key (dt)
);

-- Member retention rate table
create table dwads.ads_member_retention_rate(
  `dt` varchar(10) COMMENT 'Statistics date',
  `add_date` varchar(10) comment 'Addition date',
  `retention_day` int comment 'Retention days up to current date',
  `retention_count` bigint comment 'Retention count',
  `new_mid_count` bigint comment 'New member count for that day',
  `retention_ratio` decimal(10,2) comment 'Retention rate',
  primary key (dt)
);

DataX JSON Configuration Example

{
  "job": {
    "setting": {
      "speed": {
        "channel": 1
      }
    },
    "content": [{
      "reader": {
        "name": "hdfsreader",
        "parameter": {
          "path": "/user/hive/warehouse/ads.db/ads_member_active_count/dt=$do_date/*",
          "defaultFS": "hdfs://h121.wzk.icu:9000",
          "column": [
            {"type": "string", "value": "$do_date"},
            {"index": 0, "type": "string"},
            {"index": 1, "type": "string"},
            {"index": 2, "type": "string"}
          ],
          "fileType": "text",
          "encoding": "UTF-8",
          "fieldDelimiter": ","
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "writeMode": "replace",
          "username": "hive",
          "password": "hive@wzk.icu",
          "column": ["dt","day_count","week_count","month_count"],
          "preSql": [""],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://h122.wzk.icu:3306/dwads?useUnicode=true&characterEncoding=utf-8",
            "table": ["ads_member_active_count"]
          }]
        }
      }
    }]
  }
}

Execution Command

python datax.py -p "-Ddo_date=2020-07-21" /opt/wzk/datax/export_member_active_count.json

Shell Script Template

#!/bin/bash
JSON=/opt/wzk/datax
source /etc/profile
if [ -n "$1" ]; then
  do_date=$1
else
  do_date=`date -d "-1 day" +%F`
fi
python $DATAX_HOME/bin/datax.py -p "-Ddo_date=$do_date" $JSON/export_member_active_count.json

Error Troubleshooting Table

SymptomRoot CauseFix
DataX runs successfully but MySQL has no dataHDFS path doesn’t match partition directoryCheck DataX log reader side record count; verify HDFS directory exists; confirm -Ddo_date parameter passed
FileNotFoundExceptionpath wrong/partition directory missingVerify path with hdfs dfs -ls; ensure Hive table has corresponding partition data
Unknown database/tableMySQL database/table not createdCreate database and table first; keep jdbcUrl database name consistent
Data truncation errorColumn type/order mismatchVerify reader column index matches writer column order
replace overwrite abnormalPrimary key design unreasonableDesign composite key for multi-dimensional detail table or use insert
JSON parsing failedJSON format invalidWrite jdbcUrl as single-line string; validate JSON with jq
Communications link failureNetwork/port/permission issueTest port with telnet; check account password and authorization
datax.py: command not foundDATAX_HOME not effectiveConfirm environment variable; use absolute path
Data read but dt is wrongParameter not passed or variable name inconsistentCheck -Ddo_date matches variable name in JSON

Important Notes

  1. Version Compatibility: DataX 3.x plugin parameters may vary slightly between distributions
  2. Storage Format: hdfsreader can only read text files, does not support ORC/Parquet format
  3. Kerberos Scenario: Additional configuration for authentication and HA related parameters needed
  4. writeMode Selection: replace depends on primary key/unique key, need reasonable table structure design |