Tag: HDFS
29 articles
Big Data 243 - Offline Data Warehouse: E-commerce Core Transaction Incremental Import
Scenario: Three core e-commerce transaction tables do daily incremental to offline data warehouse ODS, partitioned by dt Conclusion: DataX uses MySQLReader + HDFSWriter.
Big Data 240 - Offline Data Warehouse Advertising Hive ADS Practice: DataX Export to MySQL
Complete solution for exporting Hive ADS layer data to MySQL using DataX.
Offline Data Warehouse Advertising Business: Flume Import HDFS + ODS/DWD
Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date.
Offline Data Warehouse Advertising Business Hive Analysis: CTR/CVR/Top100
action: User behavior; 0 impression; 1 click after impression; 2 purchase duration: Stay duration shopid: Merchant id eventtype: "ad" adtype: Format type; 1 JPG; 2 PNG
Big Data 237 - Offline Data Warehouse Hive Advertising Practice: ODS to DWD Event Parsing
This article introduces completing parsing, cleaning, and detail modeling from ODS to DWD for offline data warehouse based on advertising events in tracking logs.
Big Data 236 - Offline Data Warehouse Member Metrics Verification, DataX Export and Advertising Pipeline
This is a practical verification article for member theme and advertising business pipeline based on Hadoop + Hive + HDFS + DataX + MySQL.
Big Data 235 - Offline Data Warehouse Practice: Flume, HDFS and Hive for ODS, DWD, DWS and ADS
This article demonstrates a complete offline data warehouse pipeline from log collection to member metric analysis.
Offline Data Warehouse Hive ADS Export MySQL DataX Practice
The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter.
Big Data 233 - Offline Data Warehouse Retention Rate: DWS Modeling & ADS Hive Aggregation
Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dwsmemberretention_day table to join new member and startup detail tables to.
Big Data 232 - Hive New Member & Retention
Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'.
Offline Data Warehouse Hive Practice: DWD to DWS Daily/Weekly/Monthly Active Member
This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly).
Hive ODS Layer JSON Parsing: UDF Array Extraction, explode/json_tuple
JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL
Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing
Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily.
Big Data 228 - Flume Taildir + Custom Interceptor: Extract JSON Timestamps, Mark Headers & Partition HDFS by Event Time
Apache Flume offline log collection implementation using Taildir Source and a custom Interceptor to extract JSON timestamps, mark headers, and route HDFS partitions by ev...
Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection, Filter by Logtime
Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype
Big Data 226 - Flume Optimization for Offline Data Warehouse: batchSize, Channels, Compression, Interceptors & OOM Tuning
Flume 1.9.0 tuning guide for offline data warehouse log collection to HDFS, covering batch parameters, channel capacity and transaction sizing, JVM heap tuning...
Offline Data Warehouse Member Metrics Practice
Scenario: Use startup logs/event logs in offline data warehouse to count new, active (DAU/WAU/MAU), retention.
Big Data 223 - How to Build an Offline Data Warehouse: Tracking, Metrics and Thematic Analysis
Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.
Offline Data Warehouse Architecture Selection and Cluster Design
Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and.
Big Data 221 - Offline Data Warehouse Layering: ODS, DWD, DWS, DIM and ADS Architecture
Scenario: The more department-built data marts, the more inconsistent definitions, disconnected interfaces, forming data silos, and exploding data query costs.
Offline Data Warehouse Modeling Practice
In data warehouse architecture, Fact Table is the core table structure that stores business process metric values or facts.
Flume Collect Hive Logs to HDFS
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log dat...
Flume Dual Sink: Write Logs to Both HDFS and Local File
This is article 20 in the Big Data series. Demonstrates Flume replication mode with dual Sink architecture—same data written to both HDFS and local filesystem.
HDFS Java Client Practice: Upload/Download Files, Directory Operations and API Usage
This is article 9 in the Big Data series. Learn to operate HDFS through Java code, master Hadoop's Java Client API.
HDFS Distributed File System Read/Write Principle
Deep dive into HDFS architecture: NameNode, DataNode, Client roles, Block storage mechanism, file read/write process (Pipeline write and nearest read), and HDFS basic com...
HDFS CLI Practice Complete Command Guide
Complete HDFS CLI practice: hadoop fs common commands including directory operations, file upload/download, permission management, with three-node cluster live demo.
Hadoop Cluster WordCount Distributed Computing Practice
Complete WordCount execution on Hadoop cluster: upload files to HDFS, submit MapReduce job, view running status through YARN UI, verify true distributed computing.
Hadoop Cluster Startup and Web UI Verification
Complete startup process for Hadoop three-node cluster: format NameNode, start HDFS and YARN, verify cluster status via Web UI, including start-dfs.sh and start-yarn.
Hadoop Cluster XML Configuration Details
Detailed explanation of Hadoop cluster three-node XML configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.