Big Data 255 - Atlas Data Warehouse Metadata Management

Data Warehouse Metadata Management

Metadata, in its narrowest sense, refers to data that describes other data. More broadly, beyond the business data that business logic directly reads and processes, all other information and data required to maintain the entire system can be called metadata - such as database table schema information, task lineage relationships, and user-script-task permission mappings.

The purpose of managing metadata is to enable users to use data more efficiently, and to help platform administrators more effectively perform data maintenance and management work.

A critical function of a metadata management platform is information collection - what information to collect depends on business requirements and the target problems to be solved.

There is no absolute standard for what information should be collected, but for big data development platforms, common metadata includes:

  • Table structure information
  • Data storage space, read/write records, permission ownership, and various statistical information
  • Data lineage relationship information
  • Data business attribute information

Data Lineage

What is lineage information, or Lineage? Simply put, it refers to the upstream and downstream source-to-destination relationships between data - where data comes from and where it goes. If there’s an issue with data, you can trace through the lineage to identify which step caused the problem.

Additionally, through data lineage relationships, you can establish dependencies between tasks that produce this data, which helps the scheduling system with job orchestration, or to determine which downstream data might be affected by a failed or erroneous task.

Taking Hive tables as an example, by analyzing the execution plan of Hive scripts, it is possible to accurately locate field-level data lineage relationships.

Atlas Introduction

Atlas is a metadata framework for the Hadoop platform: a set of scalable core governance services that enable enterprises to efficiently and effectively meet compliance requirements in Hadoop and integrate with the entire enterprise data ecosystem.

Apache Atlas provides open metadata management and governance capabilities for organizations to build catalogs of data assets, classify and govern those assets, and provide collaboration capabilities around these data assets for IT teams and data analytics teams.

Apache Atlas is an open-source data governance and metadata management framework, originally developed by Hortonworks and later became an Apache Software Foundation project.

Atlas Framework Components

Altas consists of three core components: metadata collection, storage and query display. Additionally, there is an administration console for configuring metadata collection processes, metadata format definitions, and service deployment.

Altas includes the following components:

  • Core: The core component of Atlas functionality, providing metadata ingestion and export, type system, metadata storage, indexing, and querying
  • Integration: The external integration module for Atlas; external component metadata is managed through this module
  • MetastoreSource: Metadata data sources supported by Atlas, provided as plugins; currently supports metadata extraction and management from: Hive, HBase, Sqoop, Kafka, Storm
  • Applications: Upper-layer applications for Atlas, used to query metadata types and objects managed by Atlas
  • Graph Engine: Atlas uses a graph model to manage metadata objects

Core Functions

  • Metadata Management: Atlas supports centralized management of data assets in the Hadoop ecosystem (such as HDFS, Hive, HBase, Kafka, Spark, etc.)
  • Data Classification (Tagging): Allows users to tag data assets for classification, search, and access control
  • Data Lineage Analysis: Provides detailed data lineage graphs to track the entire data flow from source to final usage
  • Data Impact Analysis: Through data lineage relationships, analyzes how data modifications affect downstream systems
  • Data Auditing: Records all operations on metadata, including creation, modification, and deletion
  • Extensibility: Atlas provides an extensible meta-model that allows users to define custom metadata models as needed
  • Search and Discovery: Provides advanced search capabilities; users can quickly find relevant data assets by name, tag, or attributes

Installation Configuration

Installation Dependencies

  • Maven (installation)
  • HBase (binary package)
  • Solr (binary package)
  • Atlas (compilation)

The official release only provides source code without binary installation packages, so compilation is required.

Extraction and Configuration

Modify pom.xml:

<hadoop.version>2.9.2</hadoop.version>

Compiling Source Code

cd /opt/software/apache-atlas-sources-1.2.0
export MAVEN_OPTS="-Xms2g -Xmx2g"
mvn clean -DskipTests package -Pdist,embedded-hbase-solr

The compilation process requires approximately 600MB of Jar packages. The compilation results are in:

cd /opt/software/apache-atlas-sources-1.2.0/distro/target

You can see: apache-atlas-1.2.0-bin.tar.gz