AI Research 49 - Big Data Survey Report: Development Hist...

8/12/2025

artificial-intelligence ai big-data development history Hadoop Spark programmer-life

1. International Big Data Development History

Beginning 1997

Year	Milestone
1997	NASA first proposed “big-data” term
2001	Gartner proposed “3V” model (Volume, Variety, Velocity)
2003	Google published GFS paper (distributed file system)
2004	Google published MapReduce paper
2006	Google published Bigtable paper
2005	Hadoop framework born (Doug Cutting)

Turning Point 2008

Hadoop officially became Apache top project
Facebook data processing reached 15PB/month
Ecosystem formed:
- Storage: HBase, Cassandra
- Processing: Hive, Pig, Spark
- Collection: Flume, Sqoop
- Scheduling: ZooKeeper, Ozzie
- Machine learning: Mahout

Mainstream After 2011

Year	Milestone
2011	Apache Kafka open sourced
2012	Apache Spark launched (memory computing 100x improvement)
2014	Spark became Apache top project

Diversification 2015

Computing framework diversification:
- Batch processing: Hadoop MapReduce, Spark
- Interactive analysis: Presto, Impala
- Real-time stream computing: Spark Streaming, Flink, Storm
Market size: $10.3 billion in 2013 → $193.1 billion in 2019

2. Domestic Big Data Industry Development

Chinese Enterprise Open Source Contributions

Apache Kylin: Led by eBay China team, 2015 first Apache top project led by Chinese team
Apache Flink: Alibaba deeply involved from 2016, contributed over 50% of code

Localized Platforms

Alibaba MaxCompute: Processes EB-level data daily, processes over 100PB during Double 11
Huawei FusionInsight: Supports PB-level data management, thousand-node clusters

Future Outlook

“East Data West Computing” project promotion
Data factor market cultivation
From “following” to “running alongside”

Technology Evolution Summary

Stage	Time	Core Technology
Concept formation	1997-2005	3V model, GFS, MapReduce
Open source ecosystem	2005-2012	Hadoop ecosystem
Memory computing	2012-2014	Spark
Diversification	2015-present	Batch-stream integration, cloud-native, real-time data warehouse