Big Data 239 - Flume to HDFS to Hive: Advertising Business

Overview

This article introduces using Flume to import logs to HDFS for offline data warehouse based on advertising business, then complete ODS and DWD layer processing through Hive scripts.

Overall Architecture

The advertising business overall architecture diagram is as shown, including the complete chain of data collection, transmission, storage, and processing.

Flume Agent Configuration

Flume is a distributed, reliable, and extensible system for collecting, aggregating, and transmitting large volumes of log data.

Agent Core Components

Each Flume Agent consists of three parts:

  • Source: Used to receive data
  • Channel: Used to temporarily store data between Source and Sink
  • Sink: Used to transmit data to external storage systems (such as HDFS)

Startup Command

flume-ng agent --conf-file /opt/wzk/flume-conf/flume-log2hdfs3.conf -name a1 -Dflume.root.logger=INFO,console

Note: There is a typo -Dflume.roog.logger in the original command, it should be -Dflume.root.logger.

Scalability and Fault Tolerance

Flume Agent supports distributed deployment and can pass data between different nodes through multiple Agents. Fault tolerance relies on the queue mechanism of Channel and Sink, where events can be persisted until successfully transmitted.

Data Preparation and Loading

Prepare Data

Prepare event data files, upload to the specified directory, Flume will parse according to the configuration.

Observe Results

After execution, you can view the generated data files in HDFS.

Script Execution Order

Script execution order in advertising business:

  1. ods_load_event_log.sh
  2. dwd_load_event_log.sh
  3. dwd_load_ad_log.sh
  4. ads_load_ad_show.sh
  5. ads_load_ad_show_rate.sh
  6. ads_load_ad_show_page.sh
  7. ads_load_ad_show_page_window.sh

ODS Layer Loading

Execute ODS layer loading script:

sh /opt/wzk/hive/ods_load_event_log.sh 2020-07-21

Verify data in Hive:

hive
use ods;
select * from ods_log_event limit 5;

You can batch execute to load more data by date.

DWD Layer Loading

event_log Loading

sh /opt/wzk/hive/dwd_load_event_log.sh 2020-07-21

Verify data:

hive
use dwd;
select * from dwd_event_log limit 5;

ad_log Loading

sh /opt/wzk/hive/dwd_load_ad_log.sh 2020-07-21

Verify data count:

select count(*) from dwd_ad;

Troubleshooting Guide

SymptomRoot CauseDiagnosisFix
Flume starts but no data enters HDFSSource not listening to file/directory or format mismatchCheck Flume console logs first, then check Source listening path and whether files are landedVerify collection directory, file permissions, filename rules, confirm Source configuration matches input data
Flume command execution reports error directlyStartup parameter typoCheck parameter names in startup commandFix -Dflume.root.logger parameter
HDFS doesn’t generate target fileSink configuration error or HDFS path lacks permissionsCheck Flume Sink logs, verify HDFS target directoryVerify HDFS URI, directory permissions, Sink path template and NameNode accessibility
Hive query ODS table is emptyODS load script didn’t execute successfully, or partition not writtenExecute show partitions, check script logsConfirm script date parameter, source path, Hive database/table name, re-run specified date if needed
Hive query DWD table is emptyNo data in ODS, ETL SQL filters too strictly, or field parsing failedCheck ODS partitions first, then check DWD load logsEnsure ODS has data first, then check DWD SQL where conditions, field splitting logic
select count(*) from dwd_ad; count is abnormally lowUpstream ad_log not fully imported, or dates not all executedCompare script execution date list with HDFS raw data volumeRe-run missing dates, verify ad_log script and target partitions all landed
Data duplicates after running script on same dayLoad logic uses append instead of overwriteCheck if Hive SQL uses insert overwrite or insert intoFor partition scenarios, prefer overwrite, or clean target partition before re-running
Flume runs but poor performance/high latencyChannel capacity too small or disk IO/network bottleneckCheck Channel backlog, Sink flush, machine resourcesAdjust batch size, channel capacity, transactionCapacity, switch to file channel if needed
Log files uploaded but Flume stops collectingTaildir/Exec/Spooldir different Source behavior differencesConfirm specific Source type and whether it supports repeated consumptionClarify Source type, handle file landing, renaming, and historical file import according to type
SQL execution reports partition-related errorPartition field, date parameter, or dynamic partition settings incorrectCheck CREATE TABLE statement, script parameters, Hive parametersVerify partition field name and date format, add Hive dynamic partition related settings

Summary

This article covers Flume Agent’s Source, Channel, Sink basic structure, log file upload, Flume startup command, HDFS write verification, and core script execution order. Suitable for understanding how Flume, HDFS, and Hive work together in the Hadoop ecosystem.