Tag: Data Warehouse

45 articles

Big Data 265 - Canal Deployment

Canal is an open-source data synchronization tool from Alibaba for MySQL database incremental log parsing and synchronization.

12/31/2024

Big Data 263 - Canal Working Principle: Workflow and MySQL Binlog Basics

Canal is an open-source tool for MySQL database binlog incremental subscription and consumption.

12/30/2024

MySQL Binlog Deep Dive: Storage Directory, Change Records and Format

MySQL's Binary Log (binlog) is a log file type in MySQL that records all change operations performed on the database (excluding SELECT and SHOW queries).

12/30/2024

Canal Data Sync: Introduction, Background, Principles and Architecture

Alibaba B2B's cross-region business between domestic sellers and overseas buyers drove the need for data synchronization between Hangzhou and US data centers.

12/29/2024

Big Data 260 - Real-Time Data Warehouse: Background, Architecture and Requirements

Real-time data processing capability has become a key competitive factor for enterprises.

12/27/2024

Big Data 259 - Griffin Configuration

Apache Griffin is an open-source data quality management framework designed to help organizations monitor and improve data quality in big data environments.

12/25/2024

Big Data 258 - Griffin with Livy: Architecture, Installation, and Usage

Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios.

12/24/2024

Big Data 257 - Data Quality Monitoring

Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection.

12/23/2024

Big Data 256 - Atlas Installation

Metadata (MetaData) in the narrow sense refers to data that describes data.

12/21/2024

Big Data 255 - Atlas Data Warehouse Metadata Management

Metadata, in its narrowest sense, refers to data that describes other data.

12/20/2024

Airflow Core Trade Task Scheduling Integration for Offline Data Warehouse

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.

12/17/2024

Big Data 253 - Airflow Core Concepts

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.

12/16/2024

Big Data 252 - Airflow Crontab Scheduling

Linux systems use the cron (crond) service for scheduled tasks, which is enabled by default.

12/15/2024

Offline Data Warehouse ADS Layer and Airflow Task Task Scheduling

Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.

12/14/2024

Big Data 251 - Airflow Installation

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.

12/14/2024

Big Data 247 - Offline Data Warehouse - Hive Order Zipper Table: Incremental Refresh Implementation

This article continues the zipper table practice, focusing on order history state incremental refresh.

12/12/2024

Big Data 248 - Offline Data Warehouse: Dimension Tables

First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables.

12/12/2024

Big Data 249 - Offline Data Warehouse DWD and DWS Layer

The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables.

12/12/2024

Big Data 247 - Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh

This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking an...

12/11/2024

Big Data 246 - Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script

userinfo (partitioned table) => userid, mobile, regdate => Daily changed data (modified + new) / Historical data (first day) userhis (zipper table) => Two additional fiel...

12/10/2024

Big Data 245 - Offline Data Warehouse - Hive Zipper Table Getting Started: SCD Types, Table Creation and Loading

Slowly Changing Dimensions (SCD) refer to dimension attributes that change slowly over time in the real world (slow is relative to fact tables, where data changes faster...

12/9/2024

Big Data 244 - Offline Data Warehouse: Hive ODS Layer

Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning.

12/7/2024

Big Data 243 - Offline Data Warehouse: E-commerce Core Transaction Incremental Import

Scenario: Three core e-commerce transaction tables do daily incremental to offline data warehouse ODS, partitioned by dt Conclusion: DataX uses MySQLReader + HDFSWriter.

12/6/2024

Big Data 241 - Offline Data Warehouse Practice: E-commerce Core Transaction Data Model & MySQL Source Table Design

Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).

12/4/2024

Big Data 240 - Offline Data Warehouse Advertising Hive ADS Practice: DataX Export to MySQL

Complete solution for exporting Hive ADS layer data to MySQL using DataX.

12/3/2024

Offline Data Warehouse Advertising Business: Flume Import HDFS + ODS/DWD

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date.

12/2/2024

Offline Data Warehouse Advertising Business Hive Analysis: CTR/CVR/Top100

action: User behavior; 0 impression; 1 click after impression; 2 purchase duration: Stay duration shopid: Merchant id eventtype: "ad" adtype: Format type; 1 JPG; 2 PNG

11/30/2024

Big Data 237 - Offline Data Warehouse Hive Advertising Practice: ODS to DWD Event Parsing

This article introduces completing parsing, cleaning, and detail modeling from ODS to DWD for offline data warehouse based on advertising events in tracking logs.

11/29/2024

Big Data 236 - Offline Data Warehouse Member Metrics Verification, DataX Export and Advertising Pipeline

This is a practical verification article for member theme and advertising business pipeline based on Hadoop + Hive + HDFS + DataX + MySQL.

11/28/2024

Big Data 235 - Offline Data Warehouse Practice: Flume, HDFS and Hive for ODS, DWD, DWS and ADS

This article demonstrates a complete offline data warehouse pipeline from log collection to member metric analysis.

11/27/2024

Offline Data Warehouse Hive ADS Export MySQL DataX Practice

The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter.

11/26/2024

Big Data 233 - Offline Data Warehouse Retention Rate: DWS Modeling & ADS Hive Aggregation

Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dwsmemberretention_day table to join new member and startup detail tables to.

11/25/2024

Big Data 232 - Hive New Member & Retention

Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'.

11/23/2024

Offline Data Warehouse Hive Practice: DWD to DWS Daily/Weekly/Monthly Active Member

This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly).

11/22/2024

Hive ODS Layer JSON Parsing: UDF Array Extraction, explode/json_tuple

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL

11/21/2024

Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily.

11/20/2024

Big Data 228 - Flume Taildir + Custom Interceptor: Extract JSON Timestamps, Mark Headers & Partition HDFS by Event Time

Apache Flume offline log collection implementation using Taildir Source and a custom Interceptor to extract JSON timestamps, mark headers, and route HDFS partitions by ev...

11/19/2024

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection, Filter by Logtime

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype

11/18/2024

Big Data 226 - Flume Optimization for Offline Data Warehouse: batchSize, Channels, Compression, Interceptors & OOM Tuning

Flume 1.9.0 tuning guide for offline data warehouse log collection to HDFS, covering batch parameters, channel capacity and transaction sizing, JVM heap tuning...

11/16/2024

Tag: Data Warehouse

Big Data 265 - Canal Deployment

Big Data 263 - Canal Working Principle: Workflow and MySQL Binlog Basics

MySQL Binlog Deep Dive: Storage Directory, Change Records and Format

Canal Data Sync: Introduction, Background, Principles and Architecture

Big Data 260 - Real-Time Data Warehouse: Background, Architecture and Requirements

Big Data 259 - Griffin Configuration

Big Data 258 - Griffin with Livy: Architecture, Installation, and Usage

Big Data 257 - Data Quality Monitoring

Big Data 256 - Atlas Installation

Big Data 255 - Atlas Data Warehouse Metadata Management

Airflow Core Trade Task Scheduling Integration for Offline Data Warehouse

Big Data 253 - Airflow Core Concepts

Big Data 252 - Airflow Crontab Scheduling

Offline Data Warehouse ADS Layer and Airflow Task Task Scheduling

Big Data 251 - Airflow Installation

Big Data 247 - Offline Data Warehouse - Hive Order Zipper Table: Incremental Refresh Implementation

Big Data 248 - Offline Data Warehouse: Dimension Tables

Big Data 249 - Offline Data Warehouse DWD and DWS Layer

Big Data 247 - Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh

Big Data 246 - Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script

Big Data 245 - Offline Data Warehouse - Hive Zipper Table Getting Started: SCD Types, Table Creation and Loading

Big Data 244 - Offline Data Warehouse: Hive ODS Layer

Big Data 243 - Offline Data Warehouse: E-commerce Core Transaction Incremental Import

Big Data 241 - Offline Data Warehouse Practice: E-commerce Core Transaction Data Model & MySQL Source Table Design

Big Data 240 - Offline Data Warehouse Advertising Hive ADS Practice: DataX Export to MySQL

Offline Data Warehouse Advertising Business: Flume Import HDFS + ODS/DWD

Offline Data Warehouse Advertising Business Hive Analysis: CTR/CVR/Top100

Big Data 237 - Offline Data Warehouse Hive Advertising Practice: ODS to DWD Event Parsing

Big Data 236 - Offline Data Warehouse Member Metrics Verification, DataX Export and Advertising Pipeline

Big Data 235 - Offline Data Warehouse Practice: Flume, HDFS and Hive for ODS, DWD, DWS and ADS

Offline Data Warehouse Hive ADS Export MySQL DataX Practice

Big Data 233 - Offline Data Warehouse Retention Rate: DWS Modeling & ADS Hive Aggregation

Big Data 232 - Hive New Member & Retention

Offline Data Warehouse Hive Practice: DWD to DWS Daily/Weekly/Monthly Active Member

Hive ODS Layer JSON Parsing: UDF Array Extraction, explode/json_tuple

Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing

Big Data 228 - Flume Taildir + Custom Interceptor: Extract JSON Timestamps, Mark Headers & Partition HDFS by Event Time

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection, Filter by Logtime

Big Data 226 - Flume Optimization for Offline Data Warehouse: batchSize, Channels, Compression, Interceptors & OOM Tuning

Offline Data Warehouse Member Metrics Practice

Big Data 223 - How to Build an Offline Data Warehouse: Tracking, Metrics and Thematic Analysis

Offline Data Warehouse Architecture Selection and Cluster Design

Big Data 221 - Offline Data Warehouse Layering: ODS, DWD, DWS, DIM and ADS Architecture

Offline Data Warehouse Modeling Practice

Big Data 220 - Data Warehouse Introduction