Overall Architecture Design

Technical Solution Selection

Technical solution selection includes:

  • Framework selection
  • Software selection
  • Server selection
  • Cluster sizing estimation

Framework Selection Comparison

Apache Community Version

  • Completely open source and free
  • Active community
  • Detailed documentation

Third-party Distributions

CDH/HDP/FusionInsight:

  • Clear version management
  • Enhanced compatibility/stability
  • Simple operation and maintenance

Software Selection List

Data Collection

  • DataX
  • Flume
  • Sqoop
  • Logstash
  • Kafka

Data Storage

  • HDFS
  • HBase

Data Computation

  • Hive
  • MapReduce
  • Tez
  • Spark
  • Flink

Scheduling System

  • Airflow
  • Azkaban
  • Oozie

Metadata Management

  • Atlas

Data Quality Management

  • Griffin

Ad-hoc Query

  • Impala
  • Kylin
  • ClickHouse
  • Presto
  • Druid

Server Selection

  • Physical machines vs cloud hosts

Cluster Sizing Estimation Example

Based on the following assumptions:

  • 5 million DAU
  • 100 logs per person per day
  • 1KB per log
  • 3 replicas
  • 6 months storage

Calculation results in needing approximately 25 nodes.

Data Warehouse Naming Standards

Database Naming Rules

ods / dwd / dws / dim / temp / ads

ODS Layer Naming Format

ods_{business_line|project}_{data_source_type}_{business}

DWD Layer Naming Format

dwd_{business_line|project}_{theme_domain}_{sub_business}

DWS Layer Naming Format

dws_{business_line|project}_{theme_domain}_{summary_granularity}_{summary_time_period}

ADS Layer Naming Format

ads_{business_line|project}_{statistics_business}_{report_form|hot_sorting_topN}

DIM Layer Naming Format

dim_{business_line|project|pub_common}_{dimension}

Error Quick Reference

SymptomRoot CauseFix
Cluster disk alerts soon after going liveCapacity estimation issueRe-evaluate data growth expectations, increase storage capacity
Component compatibility issuesVersion management issueSort out component dependencies, unify versions
Installation deployment cycle too longLack of integrated deployment/operation toolsIntroduce CDH/HDP and other management platforms
Slow operation positioning, high configuration change riskLack of centralized management toolsUse cluster management tools for configuration management
Data warehouse layering混乱, table names unreadableNo unified naming standardsDevelop and enforce naming standards
Hive create database/table permission/resource conflictsInsufficient environment isolationDo well with namespace/queue isolation