Big Data 257 - Data Quality Monitoring

Why Do Data Quality Monitoring

Data quality monitoring is an ongoing process that ensures data maintains high quality throughout its lifecycle. It includes various monitoring aspects, typically covering the following main contents:

  • Accuracy: Monitor whether data accurately reflects the real-world state, ensuring data is not corrupted during collection, storage, and transmission
  • Completeness: Check whether datasets contain all required information, ensuring no missing values or blank fields
  • Consistency: Data should remain consistent across different systems. Data quality monitoring ensures data does not conflict across different data sources or platforms
  • Timeliness: Monitor whether data is collected and updated on time. Delayed data affects decision-making accuracy
  • Validity: Ensure data conforms to expected formats or ranges
  • Availability: Monitor whether data is easily accessible and usable
  • Compliance: Ensure data complies with relevant laws, regulations, and company policies
  • Duplication: Monitor for duplicate data records

Data Quality Issues

  1. Data Inconsistency: Early enterprises lacked unified planning; most information systems were built iteratively over time, with different construction timelines and varying data standards across systems
  2. Data Incompleteness: Due to isolated use of enterprise information systems, various business systems or modules enter data according to their own needs, lacking unified entry tools and data outputs
  3. Data Non-compliance: Without a unified data management platform and data source, complete data lifecycle management is lacking
  4. Data Redundancy: Different information systems have varying standards for data, coding rules, and validation methods

Monitoring Methods

Design Approach

Data quality monitoring design requirements are divided into four modules: data, rules, alerts, and feedback

  • Data: Data to be monitored, possibly stored in different storage engines
  • Rules: How to design rules for detecting anomalies - generally, numerical anomalies and period-over-period anomaly monitoring are the main approaches
  • Alerts: Alert refers to the action of sending alerts, which can be done via WeChat messages, phone calls, SMS, etc.
  • Feedback: Feedback is the response to alert content; with a feedback mechanism, the entire data monitoring can form a closed loop

Technical Solution

  • Start by focusing on core monitoring content, such as accuracy - monitor core metrics
  • The monitoring platform should not implement too complex rule logic; try to only monitor result data
  • Multiple data sources - there are two ways to monitor multiple data sources
  • Real-time data monitoring: The difference lies in scanning cycles, so design can start with offline as the primary approach but reserve design for real-time monitoring

Griffin Architecture

Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection modes. It can measure data assets from different dimensions, thereby improving data accuracy and credibility.

Griffin is mainly divided into three parts: Define, Measure, and Analyze:

  • Define: Mainly responsible for defining dimensions of data quality statistics, such as the time span for data quality statistics and the target of statistics
  • Measure: Mainly responsible for executing statistical tasks and generating statistical results
  • Analyze: Mainly responsible for storing and displaying statistical results

Griffin Key Features

  • Data Quality Assessment: Supports rule-based and model-based quality assessment, allowing definition of rules for completeness, accuracy, consistency, validity, and timeliness
  • Quality Rule Definition and Management: Users can define custom rules, use JSON format to describe data quality requirements, and periodically check data
  • Flexible Data Source Support: Supports HDFS, Hive, Kafka, HBase, etc., handling both batch and streaming processing modes
  • Multi-dimensional Data Quality Monitoring: Supports evaluation based on multiple dimensions such as time, location, and data source
  • Visual Interface: View data quality assessment results, reports, warning information, etc.
  • Integration and Compatibility: Highly integrated with Hadoop, Spark, and other big data platforms
  • Automated Repair: Supports automatic repair of some data quality issues, such as filling missing values
  • Extensibility: Provides extension interfaces and plugin mechanisms

Compilation and Installation

Related dependencies:

  • JDK 1.8
  • MySQL 5.6 and above
  • Hadoop (2.6.0 or later)
  • Hive (2.x)
  • Maven
  • Spark (2.2.1)
  • Livy (livy-0.5.0-incubating)
  • Elasticsearch (5.0 or later versions)

Notes:

  • Spark: Calculate batch and real-time metrics
  • Livy: Provide RESTful API for service to call Apache Spark
  • Elasticsearch: Store metric data
  • MySQL: Service metadata