Big Data 193 - Apache Tez Practice
Basic Introduction
Tez (pronounced “tez”) is an efficient data processing framework running in the Hadoop ecosystem, designed to optimize batch processing and interactive queries. As a top-level open-source project under Apache Software Foundation (Apache Tez), it was originally developed by Hortonworks and has become an important data processing component in the Hadoop ecosystem.
Tez’s core design goal is to serve as an alternative execution engine to MapReduce, significantly improving processing efficiency through a more flexible execution model. Compared to traditional MapReduce, Tez has these significant advantages:
- Uses Directed Acyclic Graph (DAG) execution model, allowing more complex data processing pipelines
- Supports dynamic task scheduling and resource allocation
- Reduces intermediate result disk I/O overhead
- Provides finer-grained task execution control
In practical applications, Tez is widely used in:
- Hive query acceleration (as execution engine)
- Pig script processing
- Complex ETL flows
- Interactive analysis queries
In terms of technical architecture, Tez includes these key components:
- Tez API: Provides programming interface
- Tez Runtime: Execution engine core
- DAG Scheduler: Manages task execution order
- Resource Manager Interface: Integrates with YARN
Performance tests show Tez can bring 2-10x performance improvement over traditional MapReduce for typical workloads, especially when handling complex queries and multi-stage tasks.
Tez Background
-
MapReduce Limitations: Hadoop was originally designed based on MapReduce programming model. Although this model is conceptually simple and easy to understand, it has obvious efficiency problems when processing complex data processing tasks. MapReduce uses strict “map-shuffle-reduce” execution flow, each task stage (map or reduce) needs to write intermediate results to disk. This frequent disk I/O causes significant performance overhead.
-
Tez Improvements & Advantages: To solve these MapReduce limitations, Apache Tez emerged. Tez introduces a more flexible execution engine, allowing developers to build complex data processing DAGs (Directed Acyclic Graphs) instead of being limited to fixed map-reduce stages.
Core Explanation
Tez further splits MapTask and ReduceTask into:
Tez’s Task consists of Input, Processor, Output stages, which can express all complex Map, Reduce operations.
Tez is an open-source computing framework built on Hadoop YARN that significantly improves job execution efficiency by optimizing data processing flows. Compared to traditional MapReduce framework, Tez’s core advantage is its ability to transform multiple interdependent jobs into a single comprehensive DAG (Directed Acyclic Graph) job. This optimization brings these significant improvements:
-
Data Processing Flow Optimization:
- Eliminates redundant HDFS read/write operations between multiple jobs in traditional MapReduce
- Intermediate data transferred directly in memory, reducing disk I/O overhead
- Task scheduling is smarter, can identify and optimize dependencies
-
Performance Improvement:
- For small tasks (simple data transformation or aggregation queries), performance improvement 2-3x
- For complex large tasks (ETL flows involving multi-table joins or complex calculations), performance improvement more significant, up to 7-10x
-
Applicable Scenarios:
- Interactive queries (SQL-on-Hadoop scenarios like Hive, Pig)
- Complex ETL data processing flows
- Multi-step data processing in machine learning feature engineering
- Data analysis tasks requiring low-latency response
How Tez Works
- DAG Structure: In Tez, data processing tasks are represented as a DAG (Directed Acyclic Graph), where each node represents a processing task, edges represent data flow direction. Unlike MapReduce’s fixed map and reduce stages, Tez can define any number of task nodes and data flows, making it more flexible and efficient.
- On-demand Computing Model: Tez supports on-demand data loading, avoiding unnecessary intermediate result storage. Data can be transferred directly in memory, reducing disk operations, thus speeding up computation.
Tez Characteristics
-
Efficient Resource Management: Tez is deeply integrated with YARN (Yet Another Resource Negotiator), using advanced resource scheduling algorithms to more intelligently allocate and use cluster resources. It dynamically adjusts CPU, memory and other resource allocation ratios by real-time monitoring of workload changes (such as data volume, computational complexity, etc.).
-
Reusable Containers: Tez innovatively implements container reuse mechanism. In YARN architecture, containers are basic resource allocation units (containing fixed CPU and memory quotas). Traditional frameworks like MapReduce require new containers for each task, while Tez allows the same container to be reused across multiple tasks (e.g., after Map phase completes, container can be directly used for Reduce phase).
-
Latency Optimization: Tez significantly reduces processing latency through two core technologies: 1) Uses in-memory data pipelining, avoiding frequently writing intermediate data to HDFS like MapReduce; 2) Implements intelligent task topology optimization, automatically selecting the shortest execution path.
-
Fault Tolerance: Tez provides multi-level fault tolerance: 1) Task-level retry (automatically retry failed tasks up to 3 times); 2) Checkpoint-based partial recalculation, only need to re-execute sub-tasks after failure point; 3) Speculative execution plans to handle slow node issues.
Installation & Deployment
Download package: apache-tez-0.9.2-bin.tar.gz Extract:
tar -zxvf apache-tez-0.9.0-bin.tar.gz
cd apache-tez-0.9.0-bin/share
Put tez package on HDFS:
hdfs dfs -mkdir -p /user/tez
hdfs dfs -put tez.tar.gz /user/tez
Create tez-site.xml file under $HADOOP_HOME/etc/hadoop/ with these configurations:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<!-- Specify tez package file on hdfs -->
<property>
<name>tez.lib.uris</name>
<value>hdfs://hadoop1:9000/user/tez/tez.tar.gz</value>
</property>
</configuration>
Save and copy file to all cluster nodes.
Environment Variables
Add client node config:
vim /etc/profile
export TEZ_CONF_DIR=$HADOOP_CONF_DIR
export TEZ_JARS=/opt/apps/tez/*:/opt/apps/tez/lib/*
export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH
One-time Configuration
Execute Tez in Hive:
xhive
set hive.execution.engine=tez;
Permanent Configuration
If you want to use Tez by default, need to modify config file:
vim $HIVE_HOME/conf/hive-site.xml
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
Tez Integration with Hive, Pig
- Hive on Tez: Hive is a SQL-based data warehouse tool, originally using MapReduce as underlying engine. Since introducing Tez, Hive on Tez has significantly improved query performance, especially in complex query scenarios.
- Pig on Tez: Pig is a data flow-oriented programming language, typically used for analyzing and processing large-scale data. Tez also serves as Pig’s underlying engine, greatly improving Pig script execution efficiency.
Tez Advantages
- High Performance: By reducing disk IO, optimizing task parallelization and reusing resources, Tez significantly improves data processing performance, especially in complex queries and data flow processing.
- Flexibility: Tez allows users to build arbitrarily complex DAGs based on specific data processing needs, breaking MapReduce’s fixed stage limitations.
- Scalability: Tez performs excellently in large-scale data processing environments, suitable for processing large-scale, complex batch and stream processing tasks in big data clusters.
Usage Scenarios
- Data Warehouse Query Acceleration: Many companies using Hive have turned to Tez to accelerate SQL queries, especially scenarios involving large datasets and complex operations.
- Batch Processing Optimization: Tez’s DAG model makes it ideal for executing complex batch processing tasks, like multi-stage data cleaning, transformation and loading (ETL) workflows.
- Real-time or Near Real-time Processing: Tez can be used for scenarios requiring low latency, like real-time data analysis and online reporting.
Tez Limitations
- Learning Curve: Although Tez is more flexible and efficient than MapReduce, it’s also more complex, requiring developers to understand DAG model and its configurations.
- Task Complexity: For very simple tasks, Tez’s performance improvement may not be obvious, so Tez is more suitable for complex, multi-stage task scenarios.
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Job fails immediately on start: can’t find tez.tar.gz / FileNotFoundException | tez.lib.uris HDFS path doesn’t exist or filename inconsistent | hdfs dfs -ls /user/tez + verify tez-site.xml upload and unify filename; ensure tez.lib.uris=hdfs://…/user/tez/tez.tar.gz matches actual |
| Hive switched to tez still uses MR | hive.execution.engine not effective (session/config file not loaded) | In Hive: set hive.execution.engine; one-time set hive.execution.engine=tez; permanently write to hive-site.xml and restart relevant services/clients |
| NoClassDefFoundError: org/apache/tez/… | Client classpath doesn’t include Tez jars or TEZ_JARS path wrong | echo $HADOOP_CLASSPATH + check /opt/apps/tez/lib correct TEZ_JARS path; align Tez extraction directory with env vars; re-login/source /etc/profile |
| Only some nodes can run, some report missing classes/config | tez-site.xml not distributed to all nodes or version inconsistent | Compare each node’s $HADOOP_HOME/etc/hadoop/tez-site.xml uniformly distribute tez-site.xml to all nodes (including edge nodes/HS2 nodes) and verify consistency |
| Tez AM fails to start, Container repeatedly retries | YARN resource insufficient, queue limit, ACL or AM memory settings mismatch | YARN RM UI/logs: ApplicationMaster failure reason adjust queue resources/concurrency; reduce concurrency or increase AM/Task memory; check queue ACL |
| Hive reports TezTask execution error (return code 2 etc.) | Upstream dependencies not ready (Tez lib, classpath, permissions), or SQL triggers large shuffle | First verify with minimal SQL; check HS2/Tez AM logs first run simple query to verify链路; gradually increase; if needed adjust Tez/Hive memory and parallel params |
| HDFS permission error: can’t read /user/tez | Uploaded directory permissions/owner don’t allow running user to read | hdfs dfs -ls -h /user/tez check permissions and owner give executing user/group read permission; or put Tez package in publicly readable directory and standardize permission config |
| Written but not effective (especially profile) | Env vars not loaded, line break caused export exception | env |