Tag: Flume

17 articles

Offline Data Warehouse Advertising Business: Flume Import HDFS + ODS/DWD

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date.

Big Data 237 - Offline Data Warehouse Hive Advertising Practice: ODS to DWD Event Parsing

This article introduces completing parsing, cleaning, and detail modeling from ODS to DWD for offline data warehouse based on advertising events in tracking logs.

Big Data 235 - Offline Data Warehouse Practice: Flume, HDFS and Hive for ODS, DWD, DWS and ADS

This article demonstrates a complete offline data warehouse pipeline from log collection to member metric analysis.

Hive ODS Layer JSON Parsing: UDF Array Extraction, explode/json_tuple

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL

Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily.

Big Data 228 - Flume Taildir + Custom Interceptor: Extract JSON Timestamps, Mark Headers & Partition HDFS by Event Time

Apache Flume offline log collection implementation using Taildir Source and a custom Interceptor to extract JSON timestamps, mark headers, and route HDFS partitions by ev...

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection, Filter by Logtime

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype

Big Data 226 - Flume Optimization for Offline Data Warehouse: batchSize, Channels, Compression, Interceptors & OOM Tuning

Flume 1.9.0 tuning guide for offline data warehouse log collection to HDFS, covering batch parameters, channel capacity and transaction sizing, JVM heap tuning...

Offline Data Warehouse Member Metrics Practice

Scenario: Use startup logs/event logs in offline data warehouse to count new, active (DAU/WAU/MAU), retention.

Big Data 223 - How to Build an Offline Data Warehouse: Tracking, Metrics and Thematic Analysis

Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.

Offline Data Warehouse Architecture Selection and Cluster Design

Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and.

Big Data 221 - Offline Data Warehouse Layering: ODS, DWD, DWS, DIM and ADS Architecture

Scenario: The more department-built data marts, the more inconsistent definitions, disconnected interfaces, forming data silos, and exploding data query costs.

Offline Data Warehouse Modeling Practice

In data warehouse architecture, Fact Table is the core table structure that stores business process metric values or facts.

Flume Collect Hive Logs to HDFS

Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log dat...

Flume Dual Sink: Write Logs to Both HDFS and Local File

This is article 20 in the Big Data series. Demonstrates Flume replication mode with dual Sink architecture—same data written to both HDFS and local filesystem.

Apache Flume Architecture and Core Concepts

Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.

Flume Hello World: NetCat Source + Memory Channel + Logger Sink

Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→...