Tag: flume
17 articles
Offline Data Warehouse Advertising Business: Flume Import...
Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structu...
Offline Data Warehouse Hive Advertising Business Practice...
Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...
Offline Data Warehouse Practice: Flume+HDFS+Hive Building...
Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered proce...
Hive ODS Layer JSON Parsing: UDF Array Extraction, explod...
JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...
Hive ODS Layer Practice: External Table Partition Loading...
Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...
Flume Taildir + Custom Interceptor: Extract JSON Timestam...
Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...
Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory C...
Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...
Flume Optimization for Offline Data Warehouse: batchSize,...
Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...
Offline Data Warehouse Member Metrics Practice
Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.
How to Build an Offline Data Warehouse: Tracking → Metric...
Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.
Offline Data Warehouse Architecture Selection and Cluster...
Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...
Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Arch...
When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...
Offline Data Warehouse Modeling Practice
Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.
Flume Collect Hive Logs to HDFS
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log data landing to HDFS.
Flume Dual Sink: Write Logs to Both HDFS and Local File
Through Flume replication mode (Replicating Channel Selector) and three-Agent cascade architecture, implement same log data written to both HDFS and local file, meeting both offline analysis and re...
Apache Flume Architecture and Core Concepts
Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.
Flume Hello World: NetCat Source + Memory Channel + Logge...
Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→Channel→Sink data flow.