Big Data 257 - Data Quality Monitoring

Why Do Data Quality Monitoring

Data quality monitoring is an ongoing process that ensures data maintains high quality throughout its lifecycle. It includes various monitoring aspects, typically covering the following main contents:

Accuracy: Monitor whether data accurately reflects the real-world state, ensuring data is not corrupted during collection, storage, and transmission
Completeness: Check whether datasets contain all required information, ensuring no missing values or blank fields
Consistency: Data should remain consistent across different systems. Data quality monitoring ensures data does not conflict across different data sources or platforms
Timeliness: Monitor whether data is collected and updated on time. Delayed data affects decision-making accuracy
Validity: Ensure data conforms to expected formats or ranges
Availability: Monitor whether data is easily accessible and usable
Compliance: Ensure data complies with relevant laws, regulations, and company policies
Duplication: Monitor for duplicate data records

Data Quality Issues

Data Inconsistency: Early enterprises lacked unified planning; most information systems were built iteratively over time, with different construction timelines and varying data standards across systems
Data Incompleteness: Due to isolated use of enterprise information systems, various business systems or modules enter data according to their own needs, lacking unified entry tools and data outputs
Data Non-compliance: Without a unified data management platform and data source, complete data lifecycle management is lacking
Data Redundancy: Different information systems have varying standards for data, coding rules, and validation methods

Monitoring Methods

Design Approach

Data quality monitoring design requirements are divided into four modules: data, rules, alerts, and feedback

Data: Data to be monitored, possibly stored in different storage engines
Rules: How to design rules for detecting anomalies - generally, numerical anomalies and period-over-period anomaly monitoring are the main approaches
Alerts: Alert refers to the action of sending alerts, which can be done via WeChat messages, phone calls, SMS, etc.
Feedback: Feedback is the response to alert content; with a feedback mechanism, the entire data monitoring can form a closed loop

Technical Solution

Start by focusing on core monitoring content, such as accuracy - monitor core metrics
The monitoring platform should not implement too complex rule logic; try to only monitor result data
Multiple data sources - there are two ways to monitor multiple data sources
Real-time data monitoring: The difference lies in scanning cycles, so design can start with offline as the primary approach but reserve design for real-time monitoring

Griffin Architecture

Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection modes. It can measure data assets from different dimensions, thereby improving data accuracy and credibility.

Griffin is mainly divided into three parts: Define, Measure, and Analyze:

Define: Mainly responsible for defining dimensions of data quality statistics, such as the time span for data quality statistics and the target of statistics
Measure: Mainly responsible for executing statistical tasks and generating statistical results
Analyze: Mainly responsible for storing and displaying statistical results

Griffin Key Features

Data Quality Assessment: Supports rule-based and model-based quality assessment, allowing definition of rules for completeness, accuracy, consistency, validity, and timeliness
Quality Rule Definition and Management: Users can define custom rules, use JSON format to describe data quality requirements, and periodically check data
Flexible Data Source Support: Supports HDFS, Hive, Kafka, HBase, etc., handling both batch and streaming processing modes
Multi-dimensional Data Quality Monitoring: Supports evaluation based on multiple dimensions such as time, location, and data source
Visual Interface: View data quality assessment results, reports, warning information, etc.
Integration and Compatibility: Highly integrated with Hadoop, Spark, and other big data platforms
Automated Repair: Supports automatic repair of some data quality issues, such as filling missing values
Extensibility: Provides extension interfaces and plugin mechanisms

Compilation and Installation

Related dependencies:

JDK 1.8
MySQL 5.6 and above
Hadoop (2.6.0 or later)
Hive (2.x)
Maven
Spark (2.2.1)
Livy (livy-0.5.0-incubating)
Elasticsearch (5.0 or later versions)

Notes:

Spark: Calculate batch and real-time metrics
Livy: Provide RESTful API for service to call Apache Spark
Elasticsearch: Store metric data
MySQL: Service metadata