Big Data 251 - Airflow Installation
Basic Introduction to Airflow
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb, it was donated to the Apache Software Foundation in 2016. Airflow’s main feature is defining tasks and their dependencies as code, supporting task scheduling and monitoring, making it suitable for complex big data tasks.
Airflow Features
Code-Centric
Airflow uses Python to define DAGs, providing flexibility and programmability.
Highly Extensible
Users can customize Operators and Hooks to integrate various data sources and tools.
Powerful UI Interface
Provides a visual interface for monitoring task status, viewing logs, retrying failed tasks, and more.
Rich Scheduling Options
Supports both time-based and event-based scheduling.
High Availability
When combined with executors like Celery and Kubernetes, it supports distributed architecture, suitable for handling large-scale tasks.
Use Cases
Data Pipeline Scheduling
Used to manage ETL processes from source to destination. For example, daily data extraction from databases, cleaning, and storage into data warehouses.
Machine Learning Workflow Management
Schedules data preprocessing, model training, and model deployment tasks.
Data Validation
Automatically checks data quality and consistency.
Scheduled Task Automation
Timely log cleanup, data archiving, or report generation.
Airflow Installation and Deployment
Installation Dependencies
- CentOS 7.x
- Python 3.5 or higher
- MySQL 5.7.x
- Apache Airflow 1.10.11
- Virtual machine with internet access, online installation packages required
Installation Commands
pip install apache-airflow -i https://pypi.tuna.tsinghua.edu.cn/simple
# May not be needed subsequenty; if errors occur, add missing dependencies as needed
# May not be needed
pip install mysqlclient -i https://pypi.tuna.tsinghua.edu.cn/simple
# May not be needed
pip install SQLAlchemy -i https://pypi.tuna.tsinghua.edu.cn/simple
Environment Variables
# Set directory (configuration file)
# Add to /etc/profile. If not set, default is ~/airflow
export AIRFLOW_HOME=/opt/servers/airflow
Initialize Environment
airflow initdb
At this point, you need to modify the configuration file:
vim /opt/servers/airflow/airflow.cfg
Check sql_alchemy_conn and modify it to:
mysql://hive:hive%40wzk.icu@h122.wzk.icu:3306/airflow_db
After modifying, save and re-run the initialization (ensure airflow_db has been created in the database):
airflow db init
Create User
airflow users create \
--username wzkicu \
--firstname wzk \
--lastname icu \
--role Admin \
--email airflow@wzk.icu
Start Services
airflow scheduler -D
airflow webserver -D
Access the Service
http://h122.wzk.icu:8080
Enter the username and password created earlier to access.
Web Interface Feature Overview
- Trigger Dag: Manually trigger execution
- TreeView: When a DAG is running, you can click to view each Task’s execution status (tree-based view). Statuses: success, running, failed, skipped, retry, queued, no status
- Graph View: View each Task’s execution status based on a graph view (Directed Acyclic Graph)
- Task Duration: Statistics for each Task’s execution time, can select how many recent executions to view
- Task Tries: Retry count for each Task
- Gantt View: Gantt chart view of each Task’s execution status
- Code View: View task execution code
- Logs: View execution logs, such as failure reasons
- Refresh: Refresh DAG tasks
- DELETE Dag: Delete the DAG task