Big Data 239 - Flume to HDFS to Hive: Advertising Business
Overview
This article introduces using Flume to import logs to HDFS for offline data warehouse based on advertising business, then complete ODS and DWD layer processing through Hive scripts.
Overall Architecture
The advertising business overall architecture diagram is as shown, including the complete chain of data collection, transmission, storage, and processing.
Flume Agent Configuration
Flume is a distributed, reliable, and extensible system for collecting, aggregating, and transmitting large volumes of log data.
Agent Core Components
Each Flume Agent consists of three parts:
- Source: Used to receive data
- Channel: Used to temporarily store data between Source and Sink
- Sink: Used to transmit data to external storage systems (such as HDFS)
Startup Command
flume-ng agent --conf-file /opt/wzk/flume-conf/flume-log2hdfs3.conf -name a1 -Dflume.root.logger=INFO,console
Note: There is a typo -Dflume.roog.logger in the original command, it should be -Dflume.root.logger.
Scalability and Fault Tolerance
Flume Agent supports distributed deployment and can pass data between different nodes through multiple Agents. Fault tolerance relies on the queue mechanism of Channel and Sink, where events can be persisted until successfully transmitted.
Data Preparation and Loading
Prepare Data
Prepare event data files, upload to the specified directory, Flume will parse according to the configuration.
Observe Results
After execution, you can view the generated data files in HDFS.
Script Execution Order
Script execution order in advertising business:
ods_load_event_log.shdwd_load_event_log.shdwd_load_ad_log.shads_load_ad_show.shads_load_ad_show_rate.shads_load_ad_show_page.shads_load_ad_show_page_window.sh
ODS Layer Loading
Execute ODS layer loading script:
sh /opt/wzk/hive/ods_load_event_log.sh 2020-07-21
Verify data in Hive:
hive
use ods;
select * from ods_log_event limit 5;
You can batch execute to load more data by date.
DWD Layer Loading
event_log Loading
sh /opt/wzk/hive/dwd_load_event_log.sh 2020-07-21
Verify data:
hive
use dwd;
select * from dwd_event_log limit 5;
ad_log Loading
sh /opt/wzk/hive/dwd_load_ad_log.sh 2020-07-21
Verify data count:
select count(*) from dwd_ad;
Troubleshooting Guide
| Symptom | Root Cause | Diagnosis | Fix |
|---|---|---|---|
| Flume starts but no data enters HDFS | Source not listening to file/directory or format mismatch | Check Flume console logs first, then check Source listening path and whether files are landed | Verify collection directory, file permissions, filename rules, confirm Source configuration matches input data |
| Flume command execution reports error directly | Startup parameter typo | Check parameter names in startup command | Fix -Dflume.root.logger parameter |
| HDFS doesn’t generate target file | Sink configuration error or HDFS path lacks permissions | Check Flume Sink logs, verify HDFS target directory | Verify HDFS URI, directory permissions, Sink path template and NameNode accessibility |
| Hive query ODS table is empty | ODS load script didn’t execute successfully, or partition not written | Execute show partitions, check script logs | Confirm script date parameter, source path, Hive database/table name, re-run specified date if needed |
| Hive query DWD table is empty | No data in ODS, ETL SQL filters too strictly, or field parsing failed | Check ODS partitions first, then check DWD load logs | Ensure ODS has data first, then check DWD SQL where conditions, field splitting logic |
select count(*) from dwd_ad; count is abnormally low | Upstream ad_log not fully imported, or dates not all executed | Compare script execution date list with HDFS raw data volume | Re-run missing dates, verify ad_log script and target partitions all landed |
| Data duplicates after running script on same day | Load logic uses append instead of overwrite | Check if Hive SQL uses insert overwrite or insert into | For partition scenarios, prefer overwrite, or clean target partition before re-running |
| Flume runs but poor performance/high latency | Channel capacity too small or disk IO/network bottleneck | Check Channel backlog, Sink flush, machine resources | Adjust batch size, channel capacity, transactionCapacity, switch to file channel if needed |
| Log files uploaded but Flume stops collecting | Taildir/Exec/Spooldir different Source behavior differences | Confirm specific Source type and whether it supports repeated consumption | Clarify Source type, handle file landing, renaming, and historical file import according to type |
| SQL execution reports partition-related error | Partition field, date parameter, or dynamic partition settings incorrect | Check CREATE TABLE statement, script parameters, Hive parameters | Verify partition field name and date format, add Hive dynamic partition related settings |
Summary
This article covers Flume Agent’s Source, Channel, Sink basic structure, log file upload, Flume startup command, HDFS write verification, and core script execution order. Suitable for understanding how Flume, HDFS, and Hive work together in the Hadoop ecosystem.