Big Data 239 - Flume to HDFS to Hive: Advertising Business

Overview

This article introduces using Flume to import logs to HDFS for offline data warehouse based on advertising business, then complete ODS and DWD layer processing through Hive scripts.

Overall Architecture

The advertising business overall architecture diagram is as shown, including the complete chain of data collection, transmission, storage, and processing.

Flume Agent Configuration

Flume is a distributed, reliable, and extensible system for collecting, aggregating, and transmitting large volumes of log data.

Agent Core Components

Each Flume Agent consists of three parts:

Source: Used to receive data
Channel: Used to temporarily store data between Source and Sink
Sink: Used to transmit data to external storage systems (such as HDFS)

Startup Command

flume-ng agent --conf-file /opt/wzk/flume-conf/flume-log2hdfs3.conf -name a1 -Dflume.root.logger=INFO,console

Note: There is a typo -Dflume.roog.logger in the original command, it should be -Dflume.root.logger.

Scalability and Fault Tolerance

Flume Agent supports distributed deployment and can pass data between different nodes through multiple Agents. Fault tolerance relies on the queue mechanism of Channel and Sink, where events can be persisted until successfully transmitted.

Data Preparation and Loading

Prepare Data

Prepare event data files, upload to the specified directory, Flume will parse according to the configuration.

Observe Results

After execution, you can view the generated data files in HDFS.

Script Execution Order

Script execution order in advertising business:

ods_load_event_log.sh
dwd_load_event_log.sh
dwd_load_ad_log.sh
ads_load_ad_show.sh
ads_load_ad_show_rate.sh
ads_load_ad_show_page.sh
ads_load_ad_show_page_window.sh

ODS Layer Loading

Execute ODS layer loading script:

sh /opt/wzk/hive/ods_load_event_log.sh 2020-07-21

Verify data in Hive:

hive
use ods;
select * from ods_log_event limit 5;

You can batch execute to load more data by date.

DWD Layer Loading

event_log Loading

sh /opt/wzk/hive/dwd_load_event_log.sh 2020-07-21

Verify data:

hive
use dwd;
select * from dwd_event_log limit 5;

ad_log Loading

sh /opt/wzk/hive/dwd_load_ad_log.sh 2020-07-21

Verify data count:

select count(*) from dwd_ad;

Troubleshooting Guide

Symptom	Root Cause	Diagnosis	Fix
Flume starts but no data enters HDFS	Source not listening to file/directory or format mismatch	Check Flume console logs first, then check Source listening path and whether files are landed	Verify collection directory, file permissions, filename rules, confirm Source configuration matches input data
Flume command execution reports error directly	Startup parameter typo	Check parameter names in startup command	Fix `-Dflume.root.logger` parameter
HDFS doesn’t generate target file	Sink configuration error or HDFS path lacks permissions	Check Flume Sink logs, verify HDFS target directory	Verify HDFS URI, directory permissions, Sink path template and NameNode accessibility
Hive query ODS table is empty	ODS load script didn’t execute successfully, or partition not written	Execute `show partitions`, check script logs	Confirm script date parameter, source path, Hive database/table name, re-run specified date if needed
Hive query DWD table is empty	No data in ODS, ETL SQL filters too strictly, or field parsing failed	Check ODS partitions first, then check DWD load logs	Ensure ODS has data first, then check DWD SQL where conditions, field splitting logic
`select count(*) from dwd_ad;` count is abnormally low	Upstream ad_log not fully imported, or dates not all executed	Compare script execution date list with HDFS raw data volume	Re-run missing dates, verify ad_log script and target partitions all landed
Data duplicates after running script on same day	Load logic uses append instead of overwrite	Check if Hive SQL uses `insert overwrite` or `insert into`	For partition scenarios, prefer overwrite, or clean target partition before re-running
Flume runs but poor performance/high latency	Channel capacity too small or disk IO/network bottleneck	Check Channel backlog, Sink flush, machine resources	Adjust batch size, channel capacity, transactionCapacity, switch to file channel if needed
Log files uploaded but Flume stops collecting	Taildir/Exec/Spooldir different Source behavior differences	Confirm specific Source type and whether it supports repeated consumption	Clarify Source type, handle file landing, renaming, and historical file import according to type
SQL execution reports partition-related error	Partition field, date parameter, or dynamic partition settings incorrect	Check CREATE TABLE statement, script parameters, Hive parameters	Verify partition field name and date format, add Hive dynamic partition related settings

Summary

This article covers Flume Agent’s Source, Channel, Sink basic structure, log file upload, Flume startup command, HDFS write verification, and core script execution order. Suitable for understanding how Flume, HDFS, and Hive work together in the Hadoop ecosystem.