TL;DR

  • Scenario: The more department-built data marts, the more inconsistent definitions, disconnected interfaces, forming data silos, and exploding data query costs.
  • Conclusion: Use “layering + consistent dimensions + reusable summary layer” to control complexity; ER is suitable for stable large organizations, dimensional modeling is better for agile delivery.
  • Output: A set of offline data warehouse layer definitions (ODS→DW→ADS) + modeling method comparison + common failure quick reference.

Data Mart

A Data Warehouse (DW) is a global data organization that reflects themes, but global data warehouses are often too large. In practical applications, they are divided by department or business to establish local data organizations reflecting various sub-themes, i.e., Data Marts, sometimes also called data warehouses.

Data Mart: A data collection organized by theme domain to support department-level data analysis and decision-making. For example, in a commodity sales data warehouse, multiple different theme data marts can be established:

  • Commodity Procurement Data Mart
  • Commodity Inventory Data Mart
  • Commodity Sales Data Mart

Data marts are only part of a data warehouse, implementation difficulty is greatly reduced, and can meet the urgent needs of some business departments within the enterprise, achieving considerable success initially. However, as data marts continue to grow, the shortcomings of this architecture gradually emerge. Independently built data marts within an enterprise follow different standards and construction principles, leading to data chaos and inconsistency among multiple data marts, forming numerous data silos.

Data Silo Phenomenon Explanation

As enterprise business continues to expand, organizational structure is usually divided into multiple Business Units (BU). For example, a large retail enterprise may have e-commerce BU, offline retail BU, supply chain BU, etc. Each business unit generates large amounts of business data during operations, including sales records, inventory information, customer data, etc.

Causes of Data Silo Formation

  1. Independent Data Storage Systems:

    • Different departments often use different database systems (e.g., MySQL for e-commerce, Oracle for finance)
    • Inconsistent data formats (e.g., date formats: YYYY-MM-DD vs MM/DD/YYYY)
    • Inconsistent naming standards (e.g., customer ID field may be named cust_id, customerID, or client_no)
  2. Departmental Barriers:

    • Lack of cross-departmental data sharing mechanisms
    • Access restrictions set for data security reasons
    • Data protectionism due to inter-departmental competition
  3. Technical Barriers:

    • Incompatible system architectures (e.g., some using cloud services, some using local servers)
    • Lack of unified data interface standards
    • Technical debt from legacy systems

Specific Manifestations of Data Silos

Taking a retail enterprise as an example:

  • The customer online shopping behavior data mastered by the e-commerce BU cannot be integrated with member data from offline stores
  • The inventory data from the supply chain department cannot be synchronized to the sales department in real time
  • The activity effect data from the marketing department is difficult to correlate with ROI analysis from the finance department

Negative Impacts of Data Silos

  1. Reduced Decision-Making Efficiency:

    • Management cannot obtain a complete business view
    • Need to manually integrate multiple reports to make decisions
  2. Increased Operating Costs:

    • Repetitive data collection and cleaning work
    • Need to maintain multiple independent data systems
  3. Damaged Customer Experience:

    • Unable to achieve cross-channel personalized services
    • Customers need to provide the same information repeatedly to different departments
  4. Hindered Innovation:

    • Difficult to carry out cross-departmental data analysis projects
    • Unable to fully exploit the potential value of data

Typical Scenario Example

Scenario: Customer Complaint Handling

  • Customer Service Department: Records complaint content, but cannot view the customer’s historical order data
  • After-Sales Department: Handles returns and exchanges, but doesn’t know the customer’s previous consultation records
  • Quality Department: Analyzes product issues, but lacks complete customer feedback data

This results in customer issues requiring multiple transfers, long processing cycles, and decreased customer satisfaction.

Modeling Methods

A data model is a method of data organization and storage. It emphasizes reasonable data storage from the perspectives of business, data access, and usage. With a suitable basic data storage environment model, you can get the following benefits:

  • Performance: A good data model can help us quickly query required data, reducing data I/O throughput
  • Cost: A good data model can greatly reduce unnecessary data redundancy and achieve computational result reuse, greatly reducing storage and computation costs in data systems
  • Efficiency: A good data model can greatly improve the experience of using data, improving data usage efficiency
  • Quality: A good data model can improve inconsistent data calculation definitions and reduce the possibility of data calculation errors

Big data systems need data model methods to help better organize and store data to achieve the best balance among performance, cost, efficiency, and quality.

ER Model

The ER modeling method proposed by Bill Inmon, the father of data warehousing, designs a 3NF model from an enterprise-wide perspective, using Entity Relationship (ER) model to describe enterprise business, which conforms to 3NF in normalization theory. The difference between 3NF in data warehouse and 3NF in OLTP systems is that it is abstract and subject-oriented from the enterprise perspective, rather than abstracting entity objects for specific business processes. It has the following characteristics:

  • Need comprehensive understanding of entire enterprise business and data
  • Very long implementation cycle
  • Very high requirements for modelers’ capabilities

The starting point for building a data warehouse model using the ER model is to integrate data, combine and merge data from various systems from the entire enterprise perspective by theme similarity, and perform consistent processing to serve data analysis and decision-making. However, it cannot be directly used for analysis and decision-making. The modeling steps are divided into three phases:

  • High-level Model: A highly abstract model describing main themes and relationships between themes, used to describe the overall business situation of the enterprise
  • Middle-level Model: Based on the high-level model, refine theme data items
  • Physical Model (also called bottom-level model): Based on the middle-level model, consider physical storage, and design physical attributes based on performance and platform characteristics, possibly also do table merging, partition design, etc.

Dimensional Model

The dimensional model is advocated by master Palph Kimball in the data warehouse field. His “Data Warehouse Toolkit” is the most popular classic data warehouse modeling in the data warehouse engineering field.

Dimensional models are constructed from the needs of analysis and decision-making, serving analysis needs, focusing on how users can complete requirements analysis faster, while having good query performance for large-scale complex queries. Its typical representatives are star schema, and snowflake schema in some special scenarios.

The design mainly includes the following steps:

  1. Select business processes for analysis and decision-making. Business processes can be: single business events like payment, refund. Status of certain events like current account balance. A series of related business events forming a business process.
  2. Select data granularity. In event analysis, we need to predict the degree of detail required for all analyses, thus determining the granularity to select.
  3. Identify dimension tables. After selecting granularity, design dimension tables based on this granularity, including dimension attributes used for grouping and filtering during analysis.
  4. Select facts. Determine metrics that analysis needs to measure

Modern enterprises have fast-changing business, frequent personnel turnover, and insufficient comprehensive business knowledge, leading to long design cycles for ER models. The experience of most enterprises implementing data warehouses shows: building ER models in immature, fast-changing businesses carries very high risks and is not very suitable. Dimensional modeling has low technical requirements, is quick to learn, supports agile iteration and fast delivery, completes analysis requirements faster, and has good query performance for large-scale complex queries.

Data Warehouse Layering

Data warehouse represents a way of data management and usage. It is a complete theoretical system process including data modeling, ETL (Extract, Transform, Load), job scheduling, etc. Data warehouses usually need to be layered during construction. Different businesses have different technical processing methods for layering.

The main reason for layering is to have clearer control over data when managing data. Specifically, there are main reasons:

  • Clear Data Structure: Each data layer has its scope, making it easier to locate and understand when using tables
  • Simplify Complex Problems: Decompose a complex task into multiple steps, each layer handling a single problem, which is relatively simple and easy to understand, and facilitates maintaining data accuracy. When data problems occur, you don’t need to fix all data, just fix from where the problem occurred
  • Reduce Redundant Development: Standardize data layering, develop some common intermediate data, can greatly reduce redundant calculations
  • Shield Raw Data Abnormalities: Shield business impact, don’t need to re-ingest data for every business change
  • Track Data Lineage: What the business ultimately sees is a directly usable business table, but its sources are many. If a source table has a problem, with lineage you can quickly and accurately locate the problem and understand its scope of harm

Common data warehouse layering is generally 3 layers:

  • Data Operation Layer
  • Data Warehouse Layer
  • Application Data Layer (Data Mart Layer)

Of course, based on researcher experience or business, more different layers can be divided, as long as it can achieve clear processes and convenient data querying.

ODS

ODS (Operation Data Store, data preparation area). Data from source systems of data warehouse is usually stored as-is, called ODS layer, also known as preparation area. They are the source for subsequent data warehouse processing. Main data sources of ODS layer include:

  • Business Database: Can use DataX, Sqoop and other tools to extract, scheduled extraction once daily. In real-time applications, can use Canal to monitor MySQL Binlog, real-time access to changed data
  • 埋点日志: Online systems log various logs, these logs are generally saved as files, can use Flume for scheduled extraction
  • Other Data Sources: Data purchased from third parties, or data from web crawlers

DW

DW (Data Warehouse Layer), includes DWD, DWS, DIM layers, processed from ODS layer data, mainly completing data processing and integration, building consistent dimensions, constructing reusable detail fact tables for analysis and statistics, and summarizing metrics at common granularity.

  • DWD (Data Warehouse Detail Layer): Is the isolation layer between business and data warehouse, modeled by business process, constructing fine-grained detail fact tables based on each specific business process characteristics. Can combine enterprise data usage characteristics, do appropriate redundancy of some important dimension attribute fields in detail fact tables, i.e., wide table processing.
  • DWS (Data Warehouse Service Layer): Based on DWD basic data, integrate and summarize into service data for analyzing a certain theme domain. Driven by analysis themes, based on upper-layer application and product metric needs, construct common granularity summary metric fact tables.
  • Public Dimension Layer (DIM): Based on dimensional modeling concept, build consistent dimensions
  • TMP Layer: Temporary layer, storing temporary data generated during calculation

ADS

ADS (Application Data Service): Based on DW data, integrate and summarize into service data of theme domains, used for subsequent business queries, etc.

Simple Summary

Data warehouse layer division is not fixed, can be appropriately trimmed or added based on actual needs. If business is relatively simple and independent, DWD and DWS can be merged.

Error Quick Reference

SymptomRoot Cause AnalysisFix
Same metric shows inconsistent values across different reportsDefinitions scattered across data marts/team scripts, inconsistent dimensions, inconsistent filter conditionsAlign metric definitions: statistics period, deduplication rules, status filtering; compare SQL/ETL logic differences across tables; centralize metrics: sink to DWS common granularity summary layer; unify DIM dimensions and primary key mappings; prohibit ADS from私自 changing definitions
Repeated same cleaning/Join, task chain getting longerUnclear layer boundaries, ODS/DWD directly serving upper layers; lack of reusable intermediate layerCheck DAG: same Join/filter appears repeatedly in multiple tasks; check data lineage same source multiple outputs; extract common logic to DWD (detail fact) or DIM (consistent dimension); DWS provides common granularity summary for reuse
Slow data querying, huge scan volumeDetail layer granularity too fine without partitioning/no wide table strategy; unreasonable dimension table designCheck execution plan: full table scan, too much shuffle; check if partition fields are hit; DWD sets granularity by business process and partitions; moderate wide table; dimension table split hot fields from low-frequency fields
Business changes field/logic, upper layers need massive rerunODS not doing “raw retention + change isolation”; upstream anomalies directly pollute DWCompare ODS raw vs DWD cleaned differences; check if field changes have compatibility layer; ODS only does raw landing and minimal normalization; do stable abstraction in DWD; add transitional fields and compatibility views
Dimension definition chaos: same customer has different IDs in different systemsMissing master data, lacking consistent dimensions (DIM) and mapping tablesSpot check: order, member, CRM and other systems ID mapping coverage; establish unified surrogate keys and mapping tables; DIM maintains lifecycle and SCD strategy (slowly changing dimension)
ADS tables growing like “another data warehouse”ADS took on modeling and processing responsibilities, oversteppedCheck if ADS contains complex Join/cleaning/definition calculations; ADS only does light integration and output for applications; move processing back to DW (DWD/DWS/DIM)
Temporary TMP tables exist long-term, definitions untraceableTMP treated as formal output, lacking governance and lifecycleCheck metadata: TMP survival time, reference chain,是否有 owner; TMP set TTL and naming standards; references must land in DWD/DWS; include in lineage and quality checks