Fact Tables and Dimension Tables
In data warehouse architecture, Fact Table is the core table structure that stores business process metric values or facts. It is the central table in star schema or snowflake schema, associated with multiple dimension tables.
Key characteristics of fact tables:
- Contains quantified metrics (facts) of business processes, such as sales amount, order quantity, inventory, etc.
- Usually contains foreign keys for associating with dimension tables
- Data volume is usually very large, possibly containing millions or even billions of records
- Records are typically read-only, reflecting historical business activities
Facts in fact tables can be divided into three types:
- Additive Facts: Can be aggregated by any dimension (e.g., sales amount)
- Semi-additive Facts: Can only be aggregated on some dimensions (e.g., account balance)
- Non-additive Facts: Cannot be aggregated (e.g., unit price, percentage)
The granularity of a fact table refers to the business meaning represented by each row of data in the table. Common granularities include:
- Transaction Level: Each record represents a transaction (e.g., individual order item)
- Snapshot Level: Regularly records status (e.g., daily inventory snapshot)
- Accumulated Snapshot: Tracks multiple stages of a business process (e.g., order from creation to delivery)
Common Fact Tables
Common fact tables: Order fact tables Fact table characteristics: Many tables (various fact tables), large data volume Fact tables can be divided by data granularity:
- Transaction Fact Tables
- Periodic Snapshot Fact Tables
- Accumulated Snapshot Fact Tables
Common Dimension Tables
Dimension tables can be viewed as perspectives for analyzing data. Dimension tables contain the characteristics of fact records in the fact data table. Some characteristics provide descriptive information, some specify how to summarize fact data table data to provide useful information for analysts.
Common dimension tables:
- Time Dimension
- Region Dimension
- Product Dimension
Simple Summary
- Fact tables are the content of concern (e.g., sales amount, sales volume)
- Dimension tables are perspectives for observing transactions
Fact Table Classification
Transaction Fact Tables
Transaction fact tables record transaction-level facts, saving the most atomic data, also called “atomic fact table”. Data in transaction fact tables is generated after transaction events occur, with granularity typically one record per transaction. Once a transaction is committed, fact table data is inserted, data is no longer changed, and its update method is incremental update. The date dimension in transaction fact tables records the date when the transaction occurred, it records the content of transaction activities. For example: Order tables. Through transaction fact tables, aggregated fact tables can also be built to provide users with high-performance analysis.
Periodic Snapshot Fact Tables
Periodic snapshot fact tables record facts at regular, predictable time intervals, such as daily, monthly, yearly, etc. Typical examples include daily sales snapshot tables, daily inventory snapshot tables, etc. It statistics measure statistics within the interval period, such as history-to-date, natural year-to-date, quarter-to-date, etc.
Accumulated Snapshot Fact Tables
Although both accumulated snapshot fact tables and periodic snapshot fact tables are used to store snapshot information of transaction data, there are significant differences in design concepts and application scenarios.
-
Time Period Characteristics:
- Periodic Snapshot Fact Tables: Record static snapshot data at fixed time intervals (e.g., daily, weekly, monthly). For example, bank account month-end balance snapshots.
- Accumulated Snapshot Fact Tables: Track complete business processes spanning uncertain time periods, usually containing multiple milestone dates. For example, various key node times in order processing flow.
-
Date Field Design:
- Accumulated Snapshot Fact Tables contain multiple date dimensions: process start date (e.g., order creation date), intermediate process dates (e.g., approval date, shipping date), completion date (e.g., receipt date), last update date (records data supplementation time)
- During initial loading, future dates usually use special surrogate keys (e.g., -1 indicating “not occurred”) as placeholders
-
Data Update Mechanism:
- Periodic Snapshot Fact Tables are usually read-only, reflecting status at specific points in time
- Accumulated Snapshot Fact Tables support update operations; when business process advances, subsequent date information can be supplemented
-
Typical Application Scenarios:
- Full order processing flow tracking (create→pay→ship→receive)
- Insurance claim process (report→damage assessment→approval→payment)
- Customer service tickets (create→assign→process→close)
Snowflake Model
The snowflake model is a variant of the star model, with normalized dimension tables, model shape similar to snowflake. Characteristics: Snowflake structure removes data redundancy. Star model has data redundancy, so when querying statistics, only a small number of table joins are needed, query efficiency is high. Star model does not consider dimension table normalization factors, design and implementation are easy. In cases where data redundancy is acceptable, star model is actually used more.
Snowflake Model Structure
Fact Table: The core part of the snowflake model is still the fact table. Fact tables usually store metric data of business events, such as sales amount, order quantity, etc. Fact tables contain foreign keys pointing to dimension tables.
Dimension Tables: In snowflake model, dimension tables are normalized, meaning different attributes in the same dimension are split into multiple sub-tables. For example, the “customer” dimension may contain multiple levels of attributes: basic customer information may be stored in one table, while customer address information may be stored in a separate table.
Advantages and Disadvantages of Snowflake Model
Advantages:
- Lower data redundancy: Since dimension tables are normalized, snowflake model reduces data redundancy
- Storage space saved: Due to normalized structure reducing duplicate data
- Better data consistency: Since each attribute exists only in a specific dimension table
Fact Constellation
The data warehouse consists of multiple themes, containing multiple fact tables, while dimension tables are common and can be shared. This pattern can be viewed as a collection of star models. Therefore, it is called galaxy model or fact constellation model. Characteristics: Common dimension tables
Metadata
Metadata is data about data. Metadata connects source data, data warehouses, and data applications, recording the entire process of data from generation to consumption. Metadata is like a map of all data. With this map, you can know in the data warehouse:
- What data exists
- Data distribution situation
- Data types
- Relationships between data
- Which data is frequently used, which data is rarely accessed
In big data platforms, metadata runs through the entire process of data flow in the big data platform, mainly including data source metadata, data processing metadata, data theme/specialty database metadata, service layer metadata, application layer metadata, etc.
Businesses usually classify metadata into the following types:
- Technical Metadata: Database table structure, data models, ETL programs, SQL programs, etc.
- Business Metadata: Business metrics, business codes, business terms, etc.
- Management Metadata: Data owners, data quality, data security, etc.