I. MapReduce → Spark/Tez

Reasons for Phase-out

  • Intermediate results require persistence to HDFS disk, high I/O overhead
  • Coarse-grained task scheduling, startup time of several seconds
  • Cannot support low-latency interactive queries

Alternative: Spark

  • In-memory computing
  • DAG scheduling
  • Lazy evaluation
  • Lineage-based fault tolerance

Performance Improvement

  • 100TB log analysis task: Spark 100x faster than MapReduce
  • PageRank and other iterative algorithms: 1000x speedup

Reasons for Phase-out

  • Only supports “at-least-once” message processing semantics
  • Lacks event-time windows
  • Cannot guarantee no duplicate data
  • Event-time window processing
  • Exactly-once semantics (Chandy-Lamport algorithm)
  • Unified stream-batch architecture

III. Apache Pig and Hive

Pig Limitations

  • Poor script readability
  • Complex debugging
  • Steep learning curve

Hive Limitations

  • Query latency at minute level (5-10 minutes)
  • High MapReduce disk I/O overhead
  • Not suitable for interactive analysis

Current Status

  • Pig: Essentially exited production environments
  • Hive: Transitioned to metadata management center

IV. Traditional Data Warehouse → Lakehouse Architecture

Traditional Data Warehouse Problems

  • Vertical scaling leads to exponential cost growth
  • Only handles structured data

Data Lake Problems

  • “Data swamp” phenomenon
  • Lacks ACID transaction support

Lakehouse Solution

  • Delta Lake / Apache Iceberg
  • Unified metadata management
  • Query engines: Photon, Spark SQL
Old TechnologyReplaced ByReason
MapReduceSpark/TezDisk I/O overhead, high latency
StormFlinkLacks Exactly-once semantics
PigSpark SQLPoor readability, steep learning curve
Hive (MR)Spark SQL/PrestoHigh query latency
Traditional DWLakehouseLimited scalability, restricted data types

Current Industry Status

  • 90%+ new big data platforms choose Spark
  • Flink becomes the mainstream for real-time computing
  • Hive transitions to metadata management layer