Tag: data-warehouse

45 articles

Canal Deployment: Installation, Service Startup and Commo...

Canal is an open-source data synchronization tool from Alibaba for MySQL database incremental log parsing and synchronization. It simulates the MySQL slave...

Canal Working Principle: Workflow and MySQL Binlog Introd...

Canal is an open-source tool for MySQL database binlog incremental subscription and consumption, primarily used for data synchronization and distributed...

MySQL Binlog Deep Dive: Storage Directory, Change Records...

MySQL's Binary Log (binlog) is a log file type in MySQL that records all change operations performed on the database (excluding SELECT and SHOW queries). It is...

Canal Data Sync: Introduction, Background, Principles and...

Alibaba B2B's cross-region business between domestic sellers and overseas buyers drove the need for data synchronization between Hangzhou and US data centers.

Real-time Data Warehouse: Background, Architecture, Requi...

Real-time data processing capability has become a key competitive factor for enterprises. Initially, each new requirement spawned a separate real-time task,...

Apache Griffin Configuration: pom.xml, sparkProperties an...

Apache Griffin is an open-source data quality management framework designed to help organizations monitor and improve data quality in big data environments.

Big Data 258 - Griffin with Livy: Architecture, Installat...

Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios. Its main...

Big Data 257 - Data Quality Monitoring: Monitoring Method...

Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection. It can measure data assets from...

Big Data 256 - Atlas Installation: Service Startup, Web A...

Metadata (MetaData) in the narrow sense refers to data that describes data. Broadly, all information beyond business data used to maintain system operation can...

Big Data 255 - Atlas Data Warehouse Metadata Management: ...

Atlas is a metadata framework for the Hadoop platform: a set of scalable core governance services enabling enterprises to effectively meet compliance...

Airflow Core Trade Task Scheduling Integration for Offlin...

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...

Airflow Core Concepts: DAG, Operators, Tasks and Python S...

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...

Airflow Crontab Scheduling: Introduction, Task Integratio...

Linux systems use the cron (crond) service for scheduled tasks, which is enabled by default. Linux also provides the crontab command for user-level task...

Offline Data Warehouse ADS Layer and Airflow Task Schedul...

Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.

Apache Airflow Installation and Deployment for Offline Da...

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...

Offline Data Warehouse - Hive Order Zipper Table: Increme...

This article continues the zipper table practice, focusing on order history state incremental refresh. It explains how to use ODS daily incremental table + DWD zipper table to preserve order histor...

Offline Data Warehouse Dimension Tables: Product Category...

First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables. Dimension table processing strategies vary by...

Offline Data Warehouse DWD and DWS Layer: Table Creation ...

The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables. Order statuses...

Offline Data Warehouse - Hive Zipper Table Practice: Orde...

This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking and change analysis. It cover...

Offline Data Warehouse - Hive Zipper Table Practice: Init...

This article provides a practical guide to Hive zipper table implementation for offline data warehouse modeling, covering initial loading, daily incremental updates, historical version chain closin...

Offline Data Warehouse - Hive Zipper Table Getting Starte...

This article systematically covers Slowly Changing Dimensions (SCD), detailing the core differences between SCD Type 0, 1, 2, 3, 4, and 6, and explains the applicable boundaries of snapshot tables ...

Offline Data Warehouse: Hive ODS Layer Table Creation and...

Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning. Enables fast queries of raw transaction records within 7 days, de...

Offline Data Warehouse: E-commerce Core Transaction Incre...

Using DataX (MySQLReader + HDFSWriter) to extract daily incremental data from MySQL order tables, order detail tables, and product information tables into...

Offline Data Warehouse Practice: E-commerce Core Transact...

Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).

Offline Data Warehouse Advertising Business Hive ADS Prac...

Complete solution for exporting Hive ADS layer data to MySQL using DataX. Covers ADS loading, DataX configuration, MySQL table creation, Shell script parameterized execution, and common error diagn...

Offline Data Warehouse Advertising Business: Flume Import...

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structu...

Offline Data Warehouse Advertising Business Hive Analysis...

Implementation of advertising impression, click, purchase hourly statistics based on Hive offline data warehouse, completing CTR, CVR and advertising effect...

Offline Data Warehouse Hive Advertising Business Practice...

Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...

Offline Data Warehouse Member Metrics Verification, DataX...

Offline data warehouse practice based on Hadoop + Hive + HDFS + DataX + MySQL, covering member metrics testing (active/new/retention), HDFS export, DataX sync to MySQL, and advertising business ODS...

Offline Data Warehouse Practice: Flume+HDFS+Hive Building...

Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered proce...

Offline Data Warehouse Hive ADS Export MySQL DataX Practi...

The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter. Focuses on DataX JSON configuration and common erro...

Offline Data Warehouse Retention Rate Implementation: DWS...

Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dws_member_retention_day table to join new member and startup detail tables to...

Offline Data Warehouse Hive New Member & Retention: DWS D...

Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'. Use 'full member table (with first day dt)' as deduplica...

Offline Data Warehouse Hive Practice: DWD to DWS Daily/We...

This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly). Covers the complete flow from DWD...

Hive ODS Layer JSON Parsing: UDF Array Extraction, explod...

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...

Hive ODS Layer Practice: External Table Partition Loading...

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...

Flume Taildir + Custom Interceptor: Extract JSON Timestam...

Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory C...

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...

Flume Optimization for Offline Data Warehouse: batchSize,...

Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...

Offline Data Warehouse Member Metrics Practice

Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.

How to Build an Offline Data Warehouse: Tracking → Metric...

Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.

Offline Data Warehouse Architecture Selection and Cluster...

Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...

Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Arch...

When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...

Offline Data Warehouse Modeling Practice

Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.

Data Warehouse Introduction: Four Characteristics, OLTP v...

2026 engineering practice, covering core concepts and implementation concerns for data warehouses: starting from enterprise data silos, explaining four...