Big Data #268: Real-time Warehouse ODS Layer - Writing Ka...

Core Concepts

Understanding Kafka Data Streams

Kafka is a distributed streaming platform for high-throughput message passing. In ETL processes, Kafka serves as a data message queue or stream processing source.

Dimension Table Design

Dimension tables contain detailed business entity information such as product names, customer information, time dimensions, etc. Each dimension table has a unique primary key to identify records.

Core Code Implementation

TableObject Sample Class

case class TableObject(database: String, tableName: String, typeInfo: String, dataInfo: String) extends Serializable

AreaInfo Dimension Data

case class AreaInfo(id: String, name: String, pid: String, sname: String, level: String, citycode: String, yzcode: String, mername: String, Lng: String, Lat: String, pinyin: String)

DataInfo Business Data

case class DataInfo(modifiedTime: String, orderNo: String, isPay: String, orderId: String, tradeSrc: String, payTime: String, productMoney: String, totalMoney: String, dataFlag: String, userId: String, areaId: String, createTime: String, payMethod: String, isRefund: String, tradeType: String, status: String)

HBase Connection and Writing

ConnHBase

Create HBase connection, configure ZooKeeper cluster address.

SinkHBase

Implement RichSinkFunction to write data to HBase, including methods like insertArea, insertTradeOrders, deleteArea.

Kafka Data Source

SourceKafka

Create FlinkKafkaConsumer, configure bootstrap.servers, serializer, consumer group, etc.

Main Program Flow

KafkaToHBase

Create execution environment
Get Kafka data source
Parse JSON data
Distribute processing based on database and table name
Write to corresponding HBase tables

Dimension Table Update Methods

Full Update: Overwrite each time new data is fetched
Incremental Update: Update existing records based on timestamp or version number to ensure data consistency