Overall Architecture Design
Technical Solution Selection
Technical solution selection includes:
- Framework selection
- Software selection
- Server selection
- Cluster sizing estimation
Framework Selection Comparison
Apache Community Version
- Completely open source and free
- Active community
- Detailed documentation
Third-party Distributions
CDH/HDP/FusionInsight:
- Clear version management
- Enhanced compatibility/stability
- Simple operation and maintenance
Software Selection List
Data Collection
- DataX
- Flume
- Sqoop
- Logstash
- Kafka
Data Storage
- HDFS
- HBase
Data Computation
- Hive
- MapReduce
- Tez
- Spark
- Flink
Scheduling System
- Airflow
- Azkaban
- Oozie
Metadata Management
- Atlas
Data Quality Management
- Griffin
Ad-hoc Query
- Impala
- Kylin
- ClickHouse
- Presto
- Druid
Server Selection
- Physical machines vs cloud hosts
Cluster Sizing Estimation Example
Based on the following assumptions:
- 5 million DAU
- 100 logs per person per day
- 1KB per log
- 3 replicas
- 6 months storage
Calculation results in needing approximately 25 nodes.
Data Warehouse Naming Standards
Database Naming Rules
ods / dwd / dws / dim / temp / ads
ODS Layer Naming Format
ods_{business_line|project}_{data_source_type}_{business}
DWD Layer Naming Format
dwd_{business_line|project}_{theme_domain}_{sub_business}
DWS Layer Naming Format
dws_{business_line|project}_{theme_domain}_{summary_granularity}_{summary_time_period}
ADS Layer Naming Format
ads_{business_line|project}_{statistics_business}_{report_form|hot_sorting_topN}
DIM Layer Naming Format
dim_{business_line|project|pub_common}_{dimension}
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Cluster disk alerts soon after going live | Capacity estimation issue | Re-evaluate data growth expectations, increase storage capacity |
| Component compatibility issues | Version management issue | Sort out component dependencies, unify versions |
| Installation deployment cycle too long | Lack of integrated deployment/operation tools | Introduce CDH/HDP and other management platforms |
| Slow operation positioning, high configuration change risk | Lack of centralized management tools | Use cluster management tools for configuration management |
| Data warehouse layering混乱, table names unreadable | No unified naming standards | Develop and enforce naming standards |
| Hive create database/table permission/resource conflicts | Insufficient environment isolation | Do well with namespace/queue isolation |