Offline Data Warehouse Modeling Practice

Fact Tables and Dimension Tables

In data warehouse architecture, Fact Table is the core table structure that stores business process metric values or facts. It is the central table in star schema or snowflake schema, associated with multiple dimension tables.

Key characteristics of fact tables:

Contains quantified metrics (facts) of business processes, such as sales amount, order quantity, inventory, etc.
Usually contains foreign keys for associating with dimension tables
Data volume is usually very large, possibly containing millions or even billions of records
Records are typically read-only, reflecting historical business activities

Facts in fact tables can be divided into three types:

Additive Facts: Can be aggregated by any dimension (e.g., sales amount)
Semi-additive Facts: Can only be aggregated on some dimensions (e.g., account balance)
Non-additive Facts: Cannot be aggregated (e.g., unit price, percentage)

The granularity of a fact table refers to the business meaning represented by each row of data in the table. Common granularities include:

Transaction Level: Each record represents a transaction (e.g., individual order item)
Snapshot Level: Regularly records status (e.g., daily inventory snapshot)
Accumulated Snapshot: Tracks multiple stages of a business process (e.g., order from creation to delivery)

Common Fact Tables

Common fact tables: Order fact tables Fact table characteristics: Many tables (various fact tables), large data volume Fact tables can be divided by data granularity:

Transaction Fact Tables
Periodic Snapshot Fact Tables
Accumulated Snapshot Fact Tables

Common Dimension Tables

Dimension tables can be viewed as perspectives for analyzing data. Dimension tables contain the characteristics of fact records in the fact data table. Some characteristics provide descriptive information, some specify how to summarize fact data table data to provide useful information for analysts.

Common dimension tables:

Time Dimension
Region Dimension
Product Dimension

Simple Summary

Fact tables are the content of concern (e.g., sales amount, sales volume)
Dimension tables are perspectives for observing transactions

Fact Table Classification

Transaction Fact Tables

Transaction fact tables record transaction-level facts, saving the most atomic data, also called “atomic fact table”. Data in transaction fact tables is generated after transaction events occur, with granularity typically one record per transaction. Once a transaction is committed, fact table data is inserted, data is no longer changed, and its update method is incremental update. The date dimension in transaction fact tables records the date when the transaction occurred, it records the content of transaction activities. For example: Order tables. Through transaction fact tables, aggregated fact tables can also be built to provide users with high-performance analysis.

Periodic Snapshot Fact Tables

Periodic snapshot fact tables record facts at regular, predictable time intervals, such as daily, monthly, yearly, etc. Typical examples include daily sales snapshot tables, daily inventory snapshot tables, etc. It statistics measure statistics within the interval period, such as history-to-date, natural year-to-date, quarter-to-date, etc.

Accumulated Snapshot Fact Tables

Although both accumulated snapshot fact tables and periodic snapshot fact tables are used to store snapshot information of transaction data, there are significant differences in design concepts and application scenarios.

Time Period Characteristics:
- Periodic Snapshot Fact Tables: Record static snapshot data at fixed time intervals (e.g., daily, weekly, monthly). For example, bank account month-end balance snapshots.
- Accumulated Snapshot Fact Tables: Track complete business processes spanning uncertain time periods, usually containing multiple milestone dates. For example, various key node times in order processing flow.
Date Field Design:
- Accumulated Snapshot Fact Tables contain multiple date dimensions: process start date (e.g., order creation date), intermediate process dates (e.g., approval date, shipping date), completion date (e.g., receipt date), last update date (records data supplementation time)
- During initial loading, future dates usually use special surrogate keys (e.g., -1 indicating “not occurred”) as placeholders
Data Update Mechanism:
- Periodic Snapshot Fact Tables are usually read-only, reflecting status at specific points in time
- Accumulated Snapshot Fact Tables support update operations; when business process advances, subsequent date information can be supplemented
Typical Application Scenarios:
- Full order processing flow tracking (create→pay→ship→receive)
- Insurance claim process (report→damage assessment→approval→payment)
- Customer service tickets (create→assign→process→close)

Snowflake Model

The snowflake model is a variant of the star model, with normalized dimension tables, model shape similar to snowflake. Characteristics: Snowflake structure removes data redundancy. Star model has data redundancy, so when querying statistics, only a small number of table joins are needed, query efficiency is high. Star model does not consider dimension table normalization factors, design and implementation are easy. In cases where data redundancy is acceptable, star model is actually used more.

Snowflake Model Structure

Fact Table: The core part of the snowflake model is still the fact table. Fact tables usually store metric data of business events, such as sales amount, order quantity, etc. Fact tables contain foreign keys pointing to dimension tables.

Dimension Tables: In snowflake model, dimension tables are normalized, meaning different attributes in the same dimension are split into multiple sub-tables. For example, the “customer” dimension may contain multiple levels of attributes: basic customer information may be stored in one table, while customer address information may be stored in a separate table.

Advantages and Disadvantages of Snowflake Model

Advantages:

Lower data redundancy: Since dimension tables are normalized, snowflake model reduces data redundancy
Storage space saved: Due to normalized structure reducing duplicate data
Better data consistency: Since each attribute exists only in a specific dimension table

Fact Constellation

The data warehouse consists of multiple themes, containing multiple fact tables, while dimension tables are common and can be shared. This pattern can be viewed as a collection of star models. Therefore, it is called galaxy model or fact constellation model. Characteristics: Common dimension tables

Metadata

Metadata is data about data. Metadata connects source data, data warehouses, and data applications, recording the entire process of data from generation to consumption. Metadata is like a map of all data. With this map, you can know in the data warehouse:

What data exists
Data distribution situation
Data types
Relationships between data
Which data is frequently used, which data is rarely accessed

In big data platforms, metadata runs through the entire process of data flow in the big data platform, mainly including data source metadata, data processing metadata, data theme/specialty database metadata, service layer metadata, application layer metadata, etc.

Businesses usually classify metadata into the following types:

Technical Metadata: Database table structure, data models, ETL programs, SQL programs, etc.
Business Metadata: Business metrics, business codes, business terms, etc.
Management Metadata: Data owners, data quality, data security, etc.