Tag: hive
39 articles
Hive Slowly Changing Dimension Type 2: Order History Stat...
Offline data warehouse needs to save order history state at low cost while supporting daily rollback and change analysis. This article introduces using ODS...
Offline Data Warehouse ADS Layer and Airflow Task Schedul...
Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.
Offline Data Warehouse - Hive Order Zipper Table: Increme...
This article continues the zipper table practice, focusing on order history state incremental refresh. It explains how to use ODS daily incremental table + DWD zipper table to preserve order histor...
Offline Data Warehouse Dimension Tables: Product Category...
First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables. Dimension table processing strategies vary by...
Offline Data Warehouse DWD and DWS Layer: Table Creation ...
The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables. Order statuses...
Offline Data Warehouse - Hive Zipper Table Practice: Orde...
This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking and change analysis. It cover...
Offline Data Warehouse - Hive Zipper Table Practice: Init...
This article provides a practical guide to Hive zipper table implementation for offline data warehouse modeling, covering initial loading, daily incremental updates, historical version chain closin...
Offline Data Warehouse - Hive Zipper Table Getting Starte...
This article systematically covers Slowly Changing Dimensions (SCD), detailing the core differences between SCD Type 0, 1, 2, 3, 4, and 6, and explains the applicable boundaries of snapshot tables ...
Offline Data Warehouse: Hive ODS Layer Table Creation and...
Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning. Enables fast queries of raw transaction records within 7 days, de...
Offline Data Warehouse: E-commerce Core Transaction Incre...
Using DataX (MySQLReader + HDFSWriter) to extract daily incremental data from MySQL order tables, order detail tables, and product information tables into...
Offline Data Warehouse Practice: E-commerce Core Transact...
Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).
Offline Data Warehouse Advertising Business Hive ADS Prac...
Complete solution for exporting Hive ADS layer data to MySQL using DataX. Covers ADS loading, DataX configuration, MySQL table creation, Shell script parameterized execution, and common error diagn...
Offline Data Warehouse Advertising Business: Flume Import...
Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structu...
Offline Data Warehouse Advertising Business Hive Analysis...
Implementation of advertising impression, click, purchase hourly statistics based on Hive offline data warehouse, completing CTR, CVR and advertising effect...
Offline Data Warehouse Hive Advertising Business Practice...
Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...
Offline Data Warehouse Member Metrics Verification, DataX...
Offline data warehouse practice based on Hadoop + Hive + HDFS + DataX + MySQL, covering member metrics testing (active/new/retention), HDFS export, DataX sync to MySQL, and advertising business ODS...
Offline Data Warehouse Practice: Flume+HDFS+Hive Building...
Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered proce...
Offline Data Warehouse Hive ADS Export MySQL DataX Practi...
The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter. Focuses on DataX JSON configuration and common erro...
Offline Data Warehouse Retention Rate Implementation: DWS...
Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dws_member_retention_day table to join new member and startup detail tables to...
Offline Data Warehouse Hive New Member & Retention: DWS D...
Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'. Use 'full member table (with first day dt)' as deduplica...
Offline Data Warehouse Hive Practice: DWD to DWS Daily/We...
This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly). Covers the complete flow from DWD...
Hive ODS Layer JSON Parsing: UDF Array Extraction, explod...
JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...
Hive ODS Layer Practice: External Table Partition Loading...
Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...
Flume Taildir + Custom Interceptor: Extract JSON Timestam...
Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...
Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory C...
Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...
Flume Optimization for Offline Data Warehouse: batchSize,...
Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...
Offline Data Warehouse Member Metrics Practice
Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.
How to Build an Offline Data Warehouse: Tracking → Metric...
Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.
Offline Data Warehouse Architecture Selection and Cluster...
Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...
Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Arch...
When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...
Offline Data Warehouse Modeling Practice
Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.
SparkSQL Statements: DataFrame Operations, SQL Queries & ...
Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for metadata and table operations.
Apache Tez Practice: Hive on Tez Installation & Configura...
Apache Tez (example version Tez 0.9.x) as execution engine alternative to MapReduce on Hadoop2/YARN, providing DAG (Directed Acyclic Graph) execution model for...
Sqoop and Hive Integration: MySQL ↔ Hive Bidirectional Da...
Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usage.
Hive Metastore Three Modes and Remote Deployment
Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cluster.
HiveServer2 Configuration and Beeline Remote Connection
Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.
Hive DDL and DML Operations
Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations, with complete HiveQL examples and configuration optim...
Hive HQL Advanced: Data Import/Export and Query Practice
Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation, filtering, and sorting.
Hive Introduction: Architecture and Cluster Installation
Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop cluster.