Big Data 258 - Griffin Architecture with Livy

Livy

Introduction

Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios. Its main function is to interact with Spark clusters via REST API, allowing users to submit jobs, execute code snippets, and query job status and results without directly interacting with Spark’s underlying architecture.

Some key features of Livy include:

Simplified Spark Job Submission: Users can send Spark jobs to Livy via HTTP requests instead of using the spark-submit command directly
Multi-language Support: Livy supports submitting jobs using different programming languages, including Python (via PySpark), Scala, and R
Session Management: Livy can manage Spark sessions, allowing multiple users to share sessions on the same cluster and execute interactive code snippets
Job Status Management: Livy provides APIs to view job status, logs, and results, facilitating task execution tracking and monitoring
Integration: Livy is typically integrated with tools like Jupyter Notebook and Zeppelin, allowing users to execute code on Spark clusters within these environments

Configuration Plan

Our Spark is configured on three cluster nodes, but Livy doesn’t need to be configured as a cluster. We plan to configure it on the master node:

h121.wzk.icu

Extraction and Configuration

cd /opt/software
unzip livy-0.5.0-incubating-bin.zip

mv livy-0.5.0-incubating-bin/ ../servers/livy-0.5.0

Environment Variables

# Set environment variables
vim /etc/profile

# Remember to refresh after setting
source /etc/profile

Write the following content:

export LIVY_HOME=/opt/servers/livy-0.5.0
export PATH=$PATH:$LIVY_HOME/bin

Modify Configuration

mv $LIVY_HOME/conf/livy.conf.template $LIVY_HOME/conf/livy.conf
vim $LIVY_HOME/conf/livy.conf

Modify as follows:

livy.server.host = 0.0.0.0
livy.spark.master = yarn
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true

Modify configuration file

mv $LIVY_HOME/conf/livy-env.sh.template $LIVY_HOME/conf/livy-env.sh
vim $LIVY_HOME/conf/livy-env.sh

Write the following content:

export SPARK_HOME=/opt/servers/spark-2.4.5
export HADOOP_HOME=/opt/servers/hadoop-2.9.2/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Start Service

cd /opt/servers/livy-0.5.0
mkdir logs
nohup $LIVY_HOME/bin/livy-server &

Griffin

Introduction

“Griffin” is an open-source project in the big data domain, particularly influential in distributed stream processing and big data system management. It was originally developed by LinkedIn for processing large-scale data streams. Griffin’s design goal is to simplify and automate stream data processing and monitoring work, enabling efficient real-time data stream processing in big data applications while improving data traceability and reliability.

With the development of big data technology, real-time stream processing (such as Apache Kafka, Apache Flink, etc.) plays an increasingly important role in many application scenarios. When processing these data streams, not only efficient computing and data throughput are needed, but also data traceability and monitoring must be achieved. To meet these challenges, Griffin emerged, committed to providing an efficient and scalable solution, especially in data quality monitoring and stream processing job management.

Griffin’s goals include:

Data Quality Control: Automated data quality checks to ensure accuracy and completeness of stream data
Stream Processing Job Management: Effective scheduling and monitoring of stream processing jobs, timely anomaly detection
Data Pipeline Transparency: Provide clear visibility into the data flow process for easy tracking and troubleshooting
Real-time Data Processing: Combine with stream processing systems like Apache Kafka and Apache Flink to optimize data processing efficiency

Main Functions

Griffin’s functions focus on data quality control and stream processing management. Here are some core functions:

Data Quality Monitoring: Griffin can monitor various quality issues in data streams in real-time, such as data loss, duplicates, format errors, etc., and generate reports or warnings. This is particularly important for real-time analysis and decision-making processes requiring high-quality data.
Stream Data Governance: It helps manage every aspect of data flow, ensuring data meets quality requirements during processing, reducing errors and biases caused by data issues.
Job Scheduling and Monitoring: Schedule stream processing jobs, monitor their status and performance in real-time. It can identify possible bottlenecks in the system to ensure efficient execution of stream processing tasks.
Data Traceability: When data problems occur, Griffin provides traceability to data sources, helping users quickly locate the root cause of problems, ensuring data transparency and traceability.

Architecture Design

Griffin’s architecture typically includes the following components:

Data Ingestion Layer: Responsible for ingesting data streams from data sources. Common data sources include Apache Kafka, Apache Flink, and other stream processing systems.
Data Processing Layer: Performs quality checks and stream processing operations on data. This layer may include data cleaning, validation, and other operations to ensure stream data meets expected quality standards.
Monitoring and Alerting Layer: Provides monitoring and alerting mechanisms for stream processing jobs, promptly detecting problems and notifying administrators.
Data Storage and Visualization Layer: Stores analysis results and quality reports, displaying them through a visualization interface to help users understand the status of data streams.

Extraction and Configuration

Software extraction (here I am on node h122):

cd /opt/software
unzip griffin-griffin-0.5.0.zip

mv griffin-griffin-0.5.0/ ../servers/griffin-0.5.0/
cd ../servers/griffin-0.5.0

SQL Initialization

Create database quartz in MySQL and initialize it

# SQL file is as follows
/opt/servers/griffin-0.5.0/service/src/main/resources/Init_quartz_mysql_innodb.sql

Note: Simple modifications are needed, mainly adding use quartz;

vim /opt/servers/griffin-0.5.0/service/src/main/resources/Init_quartz_mysql_innodb.sql
# Write
use quartz;

Execute the following:

Create database

# Execute in MySQL
# Create database in mysql
create database quartz;

Run SQL file:

# Execute externally
# Command line execution, create tables
cd /opt/servers/griffin-0.5.0/service/src/main/resources/
mysql -uhive -phive@wzk.icu < Init_quartz_mysql_innodb.sql

Hadoop and Hive

Create /spark/spark_conf directory on HDFS and upload the Hive configuration file hive-site.xml to that directory

hdfs dfs -mkdir -p /spark/spark_conf
hdfs dfs -put $HIVE_HOME/conf/hive-site.xml /spark/spark_conf/

Note: Upload the hive-site.xml file from the node where Griffin is installed to the corresponding HDFS directory

Environment Variables

Confirm the following required environment variables are all configured:

JAVA_HOME
SPARK_HOME
LIVY_HOME
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop