I. MapReduce → Spark/Tez
Reasons for Phase-out
- Intermediate results require persistence to HDFS disk, high I/O overhead
- Coarse-grained task scheduling, startup time of several seconds
- Cannot support low-latency interactive queries
Alternative: Spark
- In-memory computing
- DAG scheduling
- Lazy evaluation
- Lineage-based fault tolerance
Performance Improvement
- 100TB log analysis task: Spark 100x faster than MapReduce
- PageRank and other iterative algorithms: 1000x speedup
II. Apache Storm → Apache Flink
Reasons for Phase-out
- Only supports “at-least-once” message processing semantics
- Lacks event-time windows
- Cannot guarantee no duplicate data
Alternative: Flink
- Event-time window processing
- Exactly-once semantics (Chandy-Lamport algorithm)
- Unified stream-batch architecture
III. Apache Pig and Hive
Pig Limitations
- Poor script readability
- Complex debugging
- Steep learning curve
Hive Limitations
- Query latency at minute level (5-10 minutes)
- High MapReduce disk I/O overhead
- Not suitable for interactive analysis
Current Status
- Pig: Essentially exited production environments
- Hive: Transitioned to metadata management center
IV. Traditional Data Warehouse → Lakehouse Architecture
Traditional Data Warehouse Problems
- Vertical scaling leads to exponential cost growth
- Only handles structured data
Data Lake Problems
- “Data swamp” phenomenon
- Lacks ACID transaction support
Lakehouse Solution
- Delta Lake / Apache Iceberg
- Unified metadata management
- Query engines: Photon, Spark SQL
Technology Evolution Trends
| Old Technology | Replaced By | Reason |
|---|---|---|
| MapReduce | Spark/Tez | Disk I/O overhead, high latency |
| Storm | Flink | Lacks Exactly-once semantics |
| Pig | Spark SQL | Poor readability, steep learning curve |
| Hive (MR) | Spark SQL/Presto | High query latency |
| Traditional DW | Lakehouse | Limited scalability, restricted data types |
Current Industry Status
- 90%+ new big data platforms choose Spark
- Flink becomes the mainstream for real-time computing
- Hive transitions to metadata management layer