Big Data 220 - Data Warehouse Introduction

In 1988, IBM first introduced the concept of “Information Warehouse” when facing increasingly scattered enterprise information systems and growing data silo problems.

In 1991, a significant milestone arrived in the data warehouse field. Bill Inmon, known as the “father of data warehouse,” published the groundbreaking book “Building the Data Warehouse.” This book was the first to systematically define the concept of data warehouse.

Four Characteristics of Data Warehouse

1. Subject Oriented

A subject is an abstract concept, a high-level synthesis and classification of data in enterprise information systems for analysis and utilization. Logically, it corresponds to analysis objects involving a macroscopic analysis area in an enterprise.

Data organization oriented to subjects is one of the core characteristics of data warehouses. It organizes data by business subjects rather than specific applications.

2. Integrated

Data warehouse data sources are very broad, usually including:

  • Internal data sources: Multiple business system databases within the enterprise such as ERP, CRM, SCM, various operational data files, user behavior logs, internal document data
  • External data sources: Data from third-party data vendors, public datasets, social media data, industry reports and market research data

The main differences between operational data and analytical data are data characteristic differences, data integration challenges, and data transformation processes. Data transformation processes include data cleaning, data conversion, data integration, and data loading, using ETL (Extract-Transform-Load) or ELT (Extract-Load-Transform) workflows.

3. Non-Volatile

Data in the data warehouse reflects historical data content over a considerable period of time, a collection of database snapshots at different points in time, and derived data compiled and aggregated based on these snapshots.

Data stability is primarily relative to applications. Users of data warehouses mostly perform data queries or complex mining. Once data enters the data warehouse, it is generally retained for a relatively long time.

4. Time Variant

Data warehouse contains historical data at various granularities. Data in the data warehouse may relate to specific dates, days, months, quarters, or years.

Although data warehouses don’t modify data, this doesn’t mean data warehouse data is forever unchanging. Data in the data warehouse also needs updating to meet decision-making needs:

  • Data warehouse data timeframe is generally much longer than operational data timeframe
  • Business systems store current data, while data warehouse stores historical data
  • Data in data warehouse is appended in time order, all with time attributes

Data Warehouse Functions

  • Integrate enterprise business data to establish a unified data center
  • Generate business reports to understand enterprise operations
  • Provide data support for enterprise operations and decisions
  • Can serve as data sources for various businesses, forming a virtuous cycle of business data mutual feedback
  • Analyze user behavior data, use data mining to reduce investment costs and improve investment effects
  • Develop data products to directly or indirectly generate profits for the enterprise

Database Comparison: OLTP vs OLAP

The difference between databases and data warehouses actually compares OLTP with OLAP.

OLTP (Online Transaction Processing)

Also called transaction-oriented processing systems. Mainly targets daily operations on specific businesses in database systems, usually querying and modifying a small number of records. Users are more concerned with response time of operations, data security, integrity, and number of concurrent users supported.

OLAP (Online Analytical Processing)

Generally targets historical data of certain subjects to support management decisions.

Essential Comparison

FeatureOLTPOLAP
PurposeDaily transaction processingAnalysis and decision
DataCurrent dataHistorical data
OperationsCRUD (Create/Read/Update/Delete)Query mainly
DesignTransaction orientedSubject oriented
NormalizationNormalizedDenormalized

Example: Banking

  • Database (Transaction System): Every transaction a customer makes at the bank is written to the database for accounting
  • Data Warehouse (Analysis System): How many transactions occurred at a bank branch in a month, what is the current deposit balance

Data warehouse exists because databases already exist, to further mine data resources for decision-making. It’s definitely not a so-called large database.