Big Data 251 - Airflow Installation

Basic Introduction to Airflow

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb, it was donated to the Apache Software Foundation in 2016. Airflow’s main feature is defining tasks and their dependencies as code, supporting task scheduling and monitoring, making it suitable for complex big data tasks.

Airflow Features

Code-Centric

Airflow uses Python to define DAGs, providing flexibility and programmability.

Highly Extensible

Users can customize Operators and Hooks to integrate various data sources and tools.

Powerful UI Interface

Provides a visual interface for monitoring task status, viewing logs, retrying failed tasks, and more.

Rich Scheduling Options

Supports both time-based and event-based scheduling.

High Availability

When combined with executors like Celery and Kubernetes, it supports distributed architecture, suitable for handling large-scale tasks.

Use Cases

Data Pipeline Scheduling

Used to manage ETL processes from source to destination. For example, daily data extraction from databases, cleaning, and storage into data warehouses.

Machine Learning Workflow Management

Schedules data preprocessing, model training, and model deployment tasks.

Data Validation

Automatically checks data quality and consistency.

Scheduled Task Automation

Timely log cleanup, data archiving, or report generation.

Airflow Installation and Deployment

Installation Dependencies

CentOS 7.x
Python 3.5 or higher
MySQL 5.7.x
Apache Airflow 1.10.11
Virtual machine with internet access, online installation packages required

Installation Commands

pip install apache-airflow -i https://pypi.tuna.tsinghua.edu.cn/simple

# May not be needed subsequenty; if errors occur, add missing dependencies as needed
# May not be needed
pip install mysqlclient -i https://pypi.tuna.tsinghua.edu.cn/simple
# May not be needed
pip install SQLAlchemy -i https://pypi.tuna.tsinghua.edu.cn/simple

Environment Variables

# Set directory (configuration file)
# Add to /etc/profile. If not set, default is ~/airflow
export AIRFLOW_HOME=/opt/servers/airflow

Initialize Environment

airflow initdb

At this point, you need to modify the configuration file:

vim /opt/servers/airflow/airflow.cfg

Check sql_alchemy_conn and modify it to:

mysql://hive:hive%40wzk.icu@h122.wzk.icu:3306/airflow_db

After modifying, save and re-run the initialization (ensure airflow_db has been created in the database):

airflow db init

Create User

airflow users create \
   --username wzkicu \
   --firstname wzk \
   --lastname icu \
   --role Admin \
   --email airflow@wzk.icu

Start Services

airflow scheduler -D
airflow webserver -D

Access the Service

http://h122.wzk.icu:8080

Enter the username and password created earlier to access.

Web Interface Feature Overview

Trigger Dag: Manually trigger execution
TreeView: When a DAG is running, you can click to view each Task’s execution status (tree-based view). Statuses: success, running, failed, skipped, retry, queued, no status
Graph View: View each Task’s execution status based on a graph view (Directed Acyclic Graph)
Task Duration: Statistics for each Task’s execution time, can select how many recent executions to view
Task Tries: Retry count for each Task
Gantt View: Gantt chart view of each Task’s execution status
Code View: View task execution code
Logs: View execution logs, such as failure reasons
Refresh: Refresh DAG tasks
DELETE Dag: Delete the DAG task