Tag: hive

39 articles

Hive Slowly Changing Dimension Type 2: Order History Stat...

Offline data warehouse needs to save order history state at low cost while supporting daily rollback and change analysis. This article introduces using ODS...

3/15/2026

Offline Data Warehouse ADS Layer and Airflow Task Schedul...

Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.

12/14/2024

Offline Data Warehouse - Hive Order Zipper Table: Increme...

This article continues the zipper table practice, focusing on order history state incremental refresh. It explains how to use ODS daily incremental table + DWD zipper table to preserve order histor...

12/12/2024

Offline Data Warehouse Dimension Tables: Product Category...

First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables. Dimension table processing strategies vary by...

12/12/2024

Offline Data Warehouse DWD and DWS Layer: Table Creation ...

The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables. Order statuses...

12/12/2024

Offline Data Warehouse - Hive Zipper Table Practice: Orde...

This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking and change analysis. It cover...

12/11/2024

Offline Data Warehouse - Hive Zipper Table Practice: Init...

This article provides a practical guide to Hive zipper table implementation for offline data warehouse modeling, covering initial loading, daily incremental updates, historical version chain closin...

12/10/2024

Offline Data Warehouse - Hive Zipper Table Getting Starte...

This article systematically covers Slowly Changing Dimensions (SCD), detailing the core differences between SCD Type 0, 1, 2, 3, 4, and 6, and explains the applicable boundaries of snapshot tables ...

12/9/2024

Offline Data Warehouse: Hive ODS Layer Table Creation and...

Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning. Enables fast queries of raw transaction records within 7 days, de...

12/7/2024

Offline Data Warehouse: E-commerce Core Transaction Incre...

Using DataX (MySQLReader + HDFSWriter) to extract daily incremental data from MySQL order tables, order detail tables, and product information tables into...

12/6/2024

Offline Data Warehouse Practice: E-commerce Core Transact...

Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).

12/4/2024

Offline Data Warehouse Advertising Business Hive ADS Prac...

Complete solution for exporting Hive ADS layer data to MySQL using DataX. Covers ADS loading, DataX configuration, MySQL table creation, Shell script parameterized execution, and common error diagn...

12/3/2024

Offline Data Warehouse Advertising Business: Flume Import...

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structu...

12/2/2024

Offline Data Warehouse Advertising Business Hive Analysis...

Implementation of advertising impression, click, purchase hourly statistics based on Hive offline data warehouse, completing CTR, CVR and advertising effect...

11/30/2024

Offline Data Warehouse Hive Advertising Business Practice...

Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...

11/29/2024

Offline Data Warehouse Member Metrics Verification, DataX...

Offline data warehouse practice based on Hadoop + Hive + HDFS + DataX + MySQL, covering member metrics testing (active/new/retention), HDFS export, DataX sync to MySQL, and advertising business ODS...

11/28/2024

Offline Data Warehouse Practice: Flume+HDFS+Hive Building...

Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered proce...

11/27/2024

Offline Data Warehouse Hive ADS Export MySQL DataX Practi...

The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter. Focuses on DataX JSON configuration and common erro...

11/26/2024

Offline Data Warehouse Retention Rate Implementation: DWS...

Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dws_member_retention_day table to join new member and startup detail tables to...

11/25/2024

Offline Data Warehouse Hive New Member & Retention: DWS D...

Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'. Use 'full member table (with first day dt)' as deduplica...

11/23/2024

Offline Data Warehouse Hive Practice: DWD to DWS Daily/We...

This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly). Covers the complete flow from DWD...

11/22/2024

Hive ODS Layer JSON Parsing: UDF Array Extraction, explod...

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...

11/21/2024

Hive ODS Layer Practice: External Table Partition Loading...

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...

11/20/2024

Flume Taildir + Custom Interceptor: Extract JSON Timestam...

Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...

11/19/2024

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory C...

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...

11/18/2024

Flume Optimization for Offline Data Warehouse: batchSize,...

Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...

11/16/2024

Offline Data Warehouse Member Metrics Practice

Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.

11/15/2024

How to Build an Offline Data Warehouse: Tracking → Metric...

Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.

11/14/2024

Offline Data Warehouse Architecture Selection and Cluster...

Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...

11/14/2024

Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Arch...

When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...

11/13/2024

Offline Data Warehouse Modeling Practice

Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.

11/13/2024

SparkSQL Statements: DataFrame Operations, SQL Queries & ...

Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for metadata and table operations.

11/9/2024

Apache Tez Practice: Hive on Tez Installation & Configura...

Apache Tez (example version Tez 0.9.x) as execution engine alternative to MapReduce on Hadoop2/YARN, providing DAG (Directed Acyclic Graph) execution model for...

10/28/2024

Sqoop and Hive Integration: MySQL ↔ Hive Bidirectional Da...

Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usage.

7/24/2024

Hive Metastore Three Modes and Remote Deployment

Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cluster.

7/10/2024

HiveServer2 Configuration and Beeline Remote Connection

Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.

7/10/2024

Hive DDL and DML Operations

Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations, with complete HiveQL examples and configuration optim...

7/8/2024

Hive HQL Advanced: Data Import/Export and Query Practice

Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation, filtering, and sorting.

7/8/2024

Hive Introduction: Architecture and Cluster Installation

Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop cluster.

7/4/2024