Tag: hdfs

29 articles

Offline Data Warehouse: E-commerce Core Transaction Incre...

Using DataX (MySQLReader + HDFSWriter) to extract daily incremental data from MySQL order tables, order detail tables, and product information tables into...

Offline Data Warehouse Advertising Business Hive ADS Prac...

Complete solution for exporting Hive ADS layer data to MySQL using DataX. Covers ADS loading, DataX configuration, MySQL table creation, Shell script parameterized execution, and common error diagn...

Offline Data Warehouse Advertising Business: Flume Import...

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structu...

Offline Data Warehouse Advertising Business Hive Analysis...

Implementation of advertising impression, click, purchase hourly statistics based on Hive offline data warehouse, completing CTR, CVR and advertising effect...

Offline Data Warehouse Hive Advertising Business Practice...

Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...

Offline Data Warehouse Member Metrics Verification, DataX...

Offline data warehouse practice based on Hadoop + Hive + HDFS + DataX + MySQL, covering member metrics testing (active/new/retention), HDFS export, DataX sync to MySQL, and advertising business ODS...

Offline Data Warehouse Practice: Flume+HDFS+Hive Building...

Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered proce...

Offline Data Warehouse Hive ADS Export MySQL DataX Practi...

The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter. Focuses on DataX JSON configuration and common erro...

Offline Data Warehouse Retention Rate Implementation: DWS...

Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dws_member_retention_day table to join new member and startup detail tables to...

Offline Data Warehouse Hive New Member & Retention: DWS D...

Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'. Use 'full member table (with first day dt)' as deduplica...

Offline Data Warehouse Hive Practice: DWD to DWS Daily/We...

This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly). Covers the complete flow from DWD...

Hive ODS Layer JSON Parsing: UDF Array Extraction, explod...

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...

Hive ODS Layer Practice: External Table Partition Loading...

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...

Flume Taildir + Custom Interceptor: Extract JSON Timestam...

Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory C...

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...

Flume Optimization for Offline Data Warehouse: batchSize,...

Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...

Offline Data Warehouse Member Metrics Practice

Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.

How to Build an Offline Data Warehouse: Tracking → Metric...

Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.

Offline Data Warehouse Architecture Selection and Cluster...

Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...

Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Arch...

When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...

Offline Data Warehouse Modeling Practice

Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.

Flume Collect Hive Logs to HDFS

Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log data landing to HDFS.

Flume Dual Sink: Write Logs to Both HDFS and Local File

Through Flume replication mode (Replicating Channel Selector) and three-Agent cascade architecture, implement same log data written to both HDFS and local file, meeting both offline analysis and re...

HDFS Java Client Practice: Upload/Download Files, Directo...

Using Hadoop HDFS Java Client API for file operations: Maven dependency configuration, FileSystem/Path/Configuration core classes, implement file upload, download, delete, list scan and progress ba...

HDFS Distributed File System Read/Write Principle

Deep dive into HDFS architecture: NameNode, DataNode, Client roles, Block storage mechanism, file read/write process (Pipeline write and nearest read), and HDFS basic commands.

HDFS CLI Practice Complete Command Guide

Complete HDFS CLI practice: hadoop fs common commands including directory operations, file upload/download, permission management, with three-node cluster live demo.

Hadoop Cluster WordCount Distributed Computing Practice

Complete WordCount execution on Hadoop cluster: upload files to HDFS, submit MapReduce job, view running status through YARN UI, verify true distributed computing.

Hadoop Cluster Startup and Web UI Verification

Complete startup process for Hadoop three-node cluster: format NameNode, start HDFS and YARN, verify cluster status via Web UI, including start-dfs.sh and start-yarn.sh usage.

Hadoop Cluster XML Configuration Details

Detailed explanation of Hadoop cluster three-node XML configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, including NameNode, DataNode, ResourceManager configuration ...