Thought Foundation
Big data data warehouse architecture:
- Offline big data architecture: Traditional big data architecture using HDFS storage, Hive, MR, Spark for offline computing
- Lambda architecture: Adds new processing链路 on top of offline big data architecture for real-time data processing, requires maintaining two sets of code for offline and real-time processing
- Kappa architecture: Unified batch and stream processing, integrates offline and real-time processing into one set of code, lower operations cost - this is why Flink is popular, Kappa architecture has become a new trend in data warehouse architecture
- Computing framework selection: Real-time computing frameworks like Storm, Flink, etc., strongly recommend Flink, its unified batch-stream characteristics and active open source community are gradually replacing Spark
- Data storage selection: First consider query efficiency, then insertion/update issues, can choose Apache Druid, but note its defects in data updates, frequently updated data is not recommended for this solution. Of course, storage technologies need specific analysis per situation, HBase and Redis are options in different scenarios
- Realtime warehouse layering: For better unified data management, realtime warehouse can use offline warehouse data model for layered processing, can write detailed realtime data to Druid for high query efficiency for downstream use, light summarization for summarized analysis for downstream use
- Data flow solution: Realtime warehouse data source can be Kafka message queue, so data in queue can be written to data lake or data warehouse for batch analysis, or processed in real-time, downstream can write to data mart for business use
Data Lake
What is a data lake? A data lake is a centralized storage database for storing all structured and unstructured data. Data lakes can store any type of data in its native format without size limits. Data lake development is mainly for handling large data volumes, good at processing unstructured data.
We usually move all data to data lake without transformation. Each data element in the data lake is assigned a unique identifier and tagged, so the element can be found through query later. This enables better data storage.
Data Warehouse
What is a data warehouse? A data warehouse is a large-capacity storage repository located on multiple databases. Its purpose is to store large amounts of structured data and enable frequent and repeatable analysis. Typically, data warehouses are used to aggregate data from various structured sources for analysis, usually for business analysis purposes. (Note: Some data warehouses can also handle unstructured data, which is not our focus).
Differences
What are the main differences between data lake and data warehouse? In terms of storage, data in data lake is unstructured, all data is kept in original form. Store all data, only transform when analyzing. Data in data warehouse is usually extracted from transaction systems.
Before loading data into data warehouse, data is cleaned and transformed. Data lake captures semi-structured and unstructured data. Data warehouse captures structured data and organizes by schema. Data lake is very suitable for in-depth analysis of unstructured data. Data scientists may use advanced analysis tools with specific predictive modeling and statistical analysis functions.
Schema is usually defined after storing data, using less initial work and providing greater flexibility. In data warehouse, schema is defined before storing data, which requires cleaning and normalizing data, meaning less schema flexibility.
Actually, data warehouse and data lake are both places we need. Data warehouse is very suitable for repeatable reporting common in business practice. Data lake is useful when performing less direct analysis.
Lambda Architecture
Nathan Marz proposed the term Lambda Architecture for a general, scalable, and fault-tolerant data processing architecture. It is a data processing architecture that aims to process large amounts of data by leveraging the advantages of both batch processing and stream processing.
Layers
From a macro perspective, the processing flow is as follows:
All data entering the system is assigned to batch layer and speed layer for processing. Batch layer manages the main dataset (an immutable, only-appendable raw dataset) and pre-computes batch views. Service layer indexes batch views so they can be queried with low latency. Speed layer only processes recent data, any incoming queries must get results by merging batch views and real-time views.
Implementation
There are multiple ways to implement Lambda architecture because it is agnostic to underlying solutions for each layer, each layer requires specific underlying implementation, which may help make better choices and avoid excessive decisions:
- Batch layer: Write once, read many times
- Service layer: Random read, no random write, batch compute and batch write
- Speed layer: Random read, random write, incremental compute
Kappa Architecture
As mentioned earlier, Lambda Architecture has its advantages and disadvantages, people are divided into supporters and opponents. Kappa Architecture is LinkedIn’s Jay Kreps’ alternative solution based on practical experience and personal insights, analyzing Lambda Architecture’s advantages and disadvantages.
A very obvious problem with Lambda architecture is that it requires maintaining two sets of code running on batch processing and real-time computing systems respectively, and these two sets of code must produce identical results. Therefore, for designers of such systems, the question is: Why can’t we improve stream computing system to handle these problems? Why can’t stream system solve the problem of full data processing? The natural distributed characteristics of stream computing make it have good scalability, can we increase concurrency to process massive historical data?
Based on this consideration, Jay proposed Kappa alternative, which simplifies Lambda architecture. Kappa architecture system removes the batch processing system. To replace batch processing, data only needs to be quickly provided through streaming system.
How does stream computing system perform full data recalculation? Steps:
- Use Kafka or similar distributed queue to save data, save for as many days as needed
- When full calculation is needed, start a new stream computing instance from scratch, read data for processing, output to a result store
- When new instance completes, stop old stream computing instance and delete old results
Compared to Lambda architecture, in Kappa architecture, historical data is only recalculated when necessary, and real-time and batch processing use the same code. Some may question stream processing’s ability to handle high throughput for historical data, but this can be improved by controlling concurrency of new instances.
Kappa Core Concepts
Kappa architecture core concepts include:
- Use Kafka or similar distributed queue system to save data, save for as many days as needed
- When full recalculation is needed, start a new stream computing instance from scratch, read data for processing, output to a new result store
- When new instance completes, stop old stream computing instance and delete old results
ODS Data
In data warehouse modeling, raw business layer data that has not been processed is called: ODS (Operational Data Source) data. In internet companies, common ODS data includes business log data (Log) and business DB data. For business DB data, collecting data from relational databases like MySQL and importing to Hive is an important part of data warehouse production.
How to Accurately and Efficiently Sync MySQL Data to Hive
A commonly used solution is batch fetch and load, directly connect to MySQL to query table data, store locally as intermediate storage, then load file to Hive.
Although this solution is simple, with business development, problems gradually emerge:
- Performance bottleneck: As business scale grows, time spent on Select From MySQL - Save To LocalFile - Load To Hive keeps increasing, cannot meet downstream warehouse production time requirements
- Directly selecting large amounts of data from MySQL has significant impact on MySQL, easily causing slow queries, affecting normal business services
- Hive itself doesn’t support SQL primitives like UPDATE, DELETE (higher version Hive supports, but requires bucketing + ORC storage format), cannot well support UPDATE/DELETE data from MySQL
To thoroughly solve these problems, we gradually adopted real-time binlog collection for real-time processing. Binlog is MySQL’s binary log that records all data changes in MySQL. MySQL cluster’s own master-slave synchronization is based on binlog.