Big Data 193 - Apache Tez Practice

Basic Introduction

Tez (pronounced “tez”) is an efficient data processing framework running in the Hadoop ecosystem, designed to optimize batch processing and interactive queries. As a top-level open-source project under Apache Software Foundation (Apache Tez), it was originally developed by Hortonworks and has become an important data processing component in the Hadoop ecosystem.

Tez’s core design goal is to serve as an alternative execution engine to MapReduce, significantly improving processing efficiency through a more flexible execution model. Compared to traditional MapReduce, Tez has these significant advantages:

Uses Directed Acyclic Graph (DAG) execution model, allowing more complex data processing pipelines
Supports dynamic task scheduling and resource allocation
Reduces intermediate result disk I/O overhead
Provides finer-grained task execution control

In practical applications, Tez is widely used in:

Hive query acceleration (as execution engine)
Pig script processing
Complex ETL flows
Interactive analysis queries

In terms of technical architecture, Tez includes these key components:

Tez API: Provides programming interface
Tez Runtime: Execution engine core
DAG Scheduler: Manages task execution order
Resource Manager Interface: Integrates with YARN

Performance tests show Tez can bring 2-10x performance improvement over traditional MapReduce for typical workloads, especially when handling complex queries and multi-stage tasks.

Tez Background

MapReduce Limitations: Hadoop was originally designed based on MapReduce programming model. Although this model is conceptually simple and easy to understand, it has obvious efficiency problems when processing complex data processing tasks. MapReduce uses strict “map-shuffle-reduce” execution flow, each task stage (map or reduce) needs to write intermediate results to disk. This frequent disk I/O causes significant performance overhead.
Tez Improvements & Advantages: To solve these MapReduce limitations, Apache Tez emerged. Tez introduces a more flexible execution engine, allowing developers to build complex data processing DAGs (Directed Acyclic Graphs) instead of being limited to fixed map-reduce stages.

Core Explanation

Tez further splits MapTask and ReduceTask into:

Tez’s Task consists of Input, Processor, Output stages, which can express all complex Map, Reduce operations.

Tez is an open-source computing framework built on Hadoop YARN that significantly improves job execution efficiency by optimizing data processing flows. Compared to traditional MapReduce framework, Tez’s core advantage is its ability to transform multiple interdependent jobs into a single comprehensive DAG (Directed Acyclic Graph) job. This optimization brings these significant improvements:

Data Processing Flow Optimization:
- Eliminates redundant HDFS read/write operations between multiple jobs in traditional MapReduce
- Intermediate data transferred directly in memory, reducing disk I/O overhead
- Task scheduling is smarter, can identify and optimize dependencies
Performance Improvement:
- For small tasks (simple data transformation or aggregation queries), performance improvement 2-3x
- For complex large tasks (ETL flows involving multi-table joins or complex calculations), performance improvement more significant, up to 7-10x
Applicable Scenarios:
- Interactive queries (SQL-on-Hadoop scenarios like Hive, Pig)
- Complex ETL data processing flows
- Multi-step data processing in machine learning feature engineering
- Data analysis tasks requiring low-latency response

How Tez Works

DAG Structure: In Tez, data processing tasks are represented as a DAG (Directed Acyclic Graph), where each node represents a processing task, edges represent data flow direction. Unlike MapReduce’s fixed map and reduce stages, Tez can define any number of task nodes and data flows, making it more flexible and efficient.
On-demand Computing Model: Tez supports on-demand data loading, avoiding unnecessary intermediate result storage. Data can be transferred directly in memory, reducing disk operations, thus speeding up computation.

Tez Characteristics

Efficient Resource Management: Tez is deeply integrated with YARN (Yet Another Resource Negotiator), using advanced resource scheduling algorithms to more intelligently allocate and use cluster resources. It dynamically adjusts CPU, memory and other resource allocation ratios by real-time monitoring of workload changes (such as data volume, computational complexity, etc.).
Reusable Containers: Tez innovatively implements container reuse mechanism. In YARN architecture, containers are basic resource allocation units (containing fixed CPU and memory quotas). Traditional frameworks like MapReduce require new containers for each task, while Tez allows the same container to be reused across multiple tasks (e.g., after Map phase completes, container can be directly used for Reduce phase).
Latency Optimization: Tez significantly reduces processing latency through two core technologies: 1) Uses in-memory data pipelining, avoiding frequently writing intermediate data to HDFS like MapReduce; 2) Implements intelligent task topology optimization, automatically selecting the shortest execution path.
Fault Tolerance: Tez provides multi-level fault tolerance: 1) Task-level retry (automatically retry failed tasks up to 3 times); 2) Checkpoint-based partial recalculation, only need to re-execute sub-tasks after failure point; 3) Speculative execution plans to handle slow node issues.

Installation & Deployment

Download package: apache-tez-0.9.2-bin.tar.gz Extract:

tar -zxvf apache-tez-0.9.0-bin.tar.gz
cd apache-tez-0.9.0-bin/share

Put tez package on HDFS:

hdfs dfs -mkdir -p /user/tez
hdfs dfs -put tez.tar.gz /user/tez

Create tez-site.xml file under $HADOOP_HOME/etc/hadoop/ with these configurations:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <!-- Specify tez package file on hdfs -->
  <property>
    <name>tez.lib.uris</name>
    <value>hdfs://hadoop1:9000/user/tez/tez.tar.gz</value>
  </property>
</configuration>

Save and copy file to all cluster nodes.

Environment Variables

Add client node config:

vim /etc/profile

export TEZ_CONF_DIR=$HADOOP_CONF_DIR
export TEZ_JARS=/opt/apps/tez/*:/opt/apps/tez/lib/*
export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH

One-time Configuration

Execute Tez in Hive:

xhive

set hive.execution.engine=tez;

Permanent Configuration

If you want to use Tez by default, need to modify config file:

vim $HIVE_HOME/conf/hive-site.xml

<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>

Tez Integration with Hive, Pig

Hive on Tez: Hive is a SQL-based data warehouse tool, originally using MapReduce as underlying engine. Since introducing Tez, Hive on Tez has significantly improved query performance, especially in complex query scenarios.
Pig on Tez: Pig is a data flow-oriented programming language, typically used for analyzing and processing large-scale data. Tez also serves as Pig’s underlying engine, greatly improving Pig script execution efficiency.

Tez Advantages

High Performance: By reducing disk IO, optimizing task parallelization and reusing resources, Tez significantly improves data processing performance, especially in complex queries and data flow processing.
Flexibility: Tez allows users to build arbitrarily complex DAGs based on specific data processing needs, breaking MapReduce’s fixed stage limitations.
Scalability: Tez performs excellently in large-scale data processing environments, suitable for processing large-scale, complex batch and stream processing tasks in big data clusters.

Usage Scenarios

Data Warehouse Query Acceleration: Many companies using Hive have turned to Tez to accelerate SQL queries, especially scenarios involving large datasets and complex operations.
Batch Processing Optimization: Tez’s DAG model makes it ideal for executing complex batch processing tasks, like multi-stage data cleaning, transformation and loading (ETL) workflows.
Real-time or Near Real-time Processing: Tez can be used for scenarios requiring low latency, like real-time data analysis and online reporting.

Tez Limitations

Learning Curve: Although Tez is more flexible and efficient than MapReduce, it’s also more complex, requiring developers to understand DAG model and its configurations.
Task Complexity: For very simple tasks, Tez’s performance improvement may not be obvious, so Tez is more suitable for complex, multi-stage task scenarios.

Error Quick Reference

Symptom	Root Cause	Fix
Job fails immediately on start: can’t find tez.tar.gz / FileNotFoundException	tez.lib.uris HDFS path doesn’t exist or filename inconsistent	hdfs dfs -ls /user/tez + verify tez-site.xml upload and unify filename; ensure tez.lib.uris=hdfs://…/user/tez/tez.tar.gz matches actual
Hive switched to tez still uses MR	hive.execution.engine not effective (session/config file not loaded)	In Hive: set hive.execution.engine; one-time set hive.execution.engine=tez; permanently write to hive-site.xml and restart relevant services/clients
NoClassDefFoundError: org/apache/tez/…	Client classpath doesn’t include Tez jars or TEZ_JARS path wrong	echo $HADOOP_CLASSPATH + check /opt/apps/tez/lib correct TEZ_JARS path; align Tez extraction directory with env vars; re-login/source /etc/profile
Only some nodes can run, some report missing classes/config	tez-site.xml not distributed to all nodes or version inconsistent	Compare each node’s $HADOOP_HOME/etc/hadoop/tez-site.xml uniformly distribute tez-site.xml to all nodes (including edge nodes/HS2 nodes) and verify consistency
Tez AM fails to start, Container repeatedly retries	YARN resource insufficient, queue limit, ACL or AM memory settings mismatch	YARN RM UI/logs: ApplicationMaster failure reason adjust queue resources/concurrency; reduce concurrency or increase AM/Task memory; check queue ACL
Hive reports TezTask execution error (return code 2 etc.)	Upstream dependencies not ready (Tez lib, classpath, permissions), or SQL triggers large shuffle	First verify with minimal SQL; check HS2/Tez AM logs first run simple query to verify链路; gradually increase; if needed adjust Tez/Hive memory and parallel params
HDFS permission error: can’t read /user/tez	Uploaded directory permissions/owner don’t allow running user to read	hdfs dfs -ls -h /user/tez check permissions and owner give executing user/group read permission; or put Tez package in publicly readable directory and standardize permission config
Written but not effective (especially profile)	Env vars not loaded, line break caused export exception	env