Big Data 258 - Griffin Architecture with Livy
Livy
Introduction
Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios. Its main function is to interact with Spark clusters via REST API, allowing users to submit jobs, execute code snippets, and query job status and results without directly interacting with Spark’s underlying architecture.
Some key features of Livy include:
- Simplified Spark Job Submission: Users can send Spark jobs to Livy via HTTP requests instead of using the spark-submit command directly
- Multi-language Support: Livy supports submitting jobs using different programming languages, including Python (via PySpark), Scala, and R
- Session Management: Livy can manage Spark sessions, allowing multiple users to share sessions on the same cluster and execute interactive code snippets
- Job Status Management: Livy provides APIs to view job status, logs, and results, facilitating task execution tracking and monitoring
- Integration: Livy is typically integrated with tools like Jupyter Notebook and Zeppelin, allowing users to execute code on Spark clusters within these environments
Configuration Plan
Our Spark is configured on three cluster nodes, but Livy doesn’t need to be configured as a cluster. We plan to configure it on the master node:
h121.wzk.icu
Extraction and Configuration
cd /opt/software
unzip livy-0.5.0-incubating-bin.zip
mv livy-0.5.0-incubating-bin/ ../servers/livy-0.5.0
Environment Variables
# Set environment variables
vim /etc/profile
# Remember to refresh after setting
source /etc/profile
Write the following content:
export LIVY_HOME=/opt/servers/livy-0.5.0
export PATH=$PATH:$LIVY_HOME/bin
Modify Configuration
mv $LIVY_HOME/conf/livy.conf.template $LIVY_HOME/conf/livy.conf
vim $LIVY_HOME/conf/livy.conf
Modify as follows:
livy.server.host = 0.0.0.0
livy.spark.master = yarn
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true
Modify configuration file
mv $LIVY_HOME/conf/livy-env.sh.template $LIVY_HOME/conf/livy-env.sh
vim $LIVY_HOME/conf/livy-env.sh
Write the following content:
export SPARK_HOME=/opt/servers/spark-2.4.5
export HADOOP_HOME=/opt/servers/hadoop-2.9.2/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Start Service
cd /opt/servers/livy-0.5.0
mkdir logs
nohup $LIVY_HOME/bin/livy-server &
Griffin
Introduction
“Griffin” is an open-source project in the big data domain, particularly influential in distributed stream processing and big data system management. It was originally developed by LinkedIn for processing large-scale data streams. Griffin’s design goal is to simplify and automate stream data processing and monitoring work, enabling efficient real-time data stream processing in big data applications while improving data traceability and reliability.
With the development of big data technology, real-time stream processing (such as Apache Kafka, Apache Flink, etc.) plays an increasingly important role in many application scenarios. When processing these data streams, not only efficient computing and data throughput are needed, but also data traceability and monitoring must be achieved. To meet these challenges, Griffin emerged, committed to providing an efficient and scalable solution, especially in data quality monitoring and stream processing job management.
Griffin’s goals include:
- Data Quality Control: Automated data quality checks to ensure accuracy and completeness of stream data
- Stream Processing Job Management: Effective scheduling and monitoring of stream processing jobs, timely anomaly detection
- Data Pipeline Transparency: Provide clear visibility into the data flow process for easy tracking and troubleshooting
- Real-time Data Processing: Combine with stream processing systems like Apache Kafka and Apache Flink to optimize data processing efficiency
Main Functions
Griffin’s functions focus on data quality control and stream processing management. Here are some core functions:
- Data Quality Monitoring: Griffin can monitor various quality issues in data streams in real-time, such as data loss, duplicates, format errors, etc., and generate reports or warnings. This is particularly important for real-time analysis and decision-making processes requiring high-quality data.
- Stream Data Governance: It helps manage every aspect of data flow, ensuring data meets quality requirements during processing, reducing errors and biases caused by data issues.
- Job Scheduling and Monitoring: Schedule stream processing jobs, monitor their status and performance in real-time. It can identify possible bottlenecks in the system to ensure efficient execution of stream processing tasks.
- Data Traceability: When data problems occur, Griffin provides traceability to data sources, helping users quickly locate the root cause of problems, ensuring data transparency and traceability.
Architecture Design
Griffin’s architecture typically includes the following components:
- Data Ingestion Layer: Responsible for ingesting data streams from data sources. Common data sources include Apache Kafka, Apache Flink, and other stream processing systems.
- Data Processing Layer: Performs quality checks and stream processing operations on data. This layer may include data cleaning, validation, and other operations to ensure stream data meets expected quality standards.
- Monitoring and Alerting Layer: Provides monitoring and alerting mechanisms for stream processing jobs, promptly detecting problems and notifying administrators.
- Data Storage and Visualization Layer: Stores analysis results and quality reports, displaying them through a visualization interface to help users understand the status of data streams.
Extraction and Configuration
Software extraction (here I am on node h122):
cd /opt/software
unzip griffin-griffin-0.5.0.zip
mv griffin-griffin-0.5.0/ ../servers/griffin-0.5.0/
cd ../servers/griffin-0.5.0
SQL Initialization
Create database quartz in MySQL and initialize it
# SQL file is as follows
/opt/servers/griffin-0.5.0/service/src/main/resources/Init_quartz_mysql_innodb.sql
Note: Simple modifications are needed, mainly adding use quartz;
vim /opt/servers/griffin-0.5.0/service/src/main/resources/Init_quartz_mysql_innodb.sql
# Write
use quartz;
Execute the following:
Create database
# Execute in MySQL
# Create database in mysql
create database quartz;
Run SQL file:
# Execute externally
# Command line execution, create tables
cd /opt/servers/griffin-0.5.0/service/src/main/resources/
mysql -uhive -phive@wzk.icu < Init_quartz_mysql_innodb.sql
Hadoop and Hive
Create /spark/spark_conf directory on HDFS and upload the Hive configuration file hive-site.xml to that directory
hdfs dfs -mkdir -p /spark/spark_conf
hdfs dfs -put $HIVE_HOME/conf/hive-site.xml /spark/spark_conf/
Note: Upload the hive-site.xml file from the node where Griffin is installed to the corresponding HDFS directory
Environment Variables
Confirm the following required environment variables are all configured:
- JAVA_HOME
- SPARK_HOME
- LIVY_HOME
- HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop