Tag: flume

17 articles

Offline Data Warehouse Advertising Business: Flume Import...

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structu...

12/2/2024

Offline Data Warehouse Hive Advertising Business Practice...

Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...

11/29/2024

Offline Data Warehouse Practice: Flume+HDFS+Hive Building...

Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered proce...

11/27/2024

Hive ODS Layer JSON Parsing: UDF Array Extraction, explod...

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...

11/21/2024

Hive ODS Layer Practice: External Table Partition Loading...

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...

11/20/2024

Flume Taildir + Custom Interceptor: Extract JSON Timestam...

Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...

11/19/2024

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory C...

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...

11/18/2024

Flume Optimization for Offline Data Warehouse: batchSize,...

Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...

11/16/2024

Offline Data Warehouse Member Metrics Practice

Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.

11/15/2024

How to Build an Offline Data Warehouse: Tracking → Metric...

Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.

11/14/2024

Offline Data Warehouse Architecture Selection and Cluster...

Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...

11/14/2024

Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Arch...

When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...

11/13/2024

Offline Data Warehouse Modeling Practice

Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.

11/13/2024

Flume Collect Hive Logs to HDFS

Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log data landing to HDFS.

7/17/2024

Flume Dual Sink: Write Logs to Both HDFS and Local File

Through Flume replication mode (Replicating Channel Selector) and three-Agent cascade architecture, implement same log data written to both HDFS and local file, meeting both offline analysis and re...

7/17/2024

Apache Flume Architecture and Core Concepts

Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.

7/13/2024

Flume Hello World: NetCat Source + Memory Channel + Logge...

Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→Channel→Sink data flow.

7/13/2024