Core Concepts

Understanding Kafka Data Streams

Kafka is a distributed streaming platform for high-throughput message passing. In ETL processes, Kafka serves as a data message queue or stream processing source.

Dimension Table Design

Dimension tables contain detailed business entity information such as product names, customer information, time dimensions, etc. Each dimension table has a unique primary key to identify records.

Core Code Implementation

TableObject Sample Class

case class TableObject(database: String, tableName: String, typeInfo: String, dataInfo: String) extends Serializable

AreaInfo Dimension Data

case class AreaInfo(id: String, name: String, pid: String, sname: String, level: String, citycode: String, yzcode: String, mername: String, Lng: String, Lat: String, pinyin: String)

DataInfo Business Data

case class DataInfo(modifiedTime: String, orderNo: String, isPay: String, orderId: String, tradeSrc: String, payTime: String, productMoney: String, totalMoney: String, dataFlag: String, userId: String, areaId: String, createTime: String, payMethod: String, isRefund: String, tradeType: String, status: String)

HBase Connection and Writing

ConnHBase

Create HBase connection, configure ZooKeeper cluster address.

SinkHBase

Implement RichSinkFunction to write data to HBase, including methods like insertArea, insertTradeOrders, deleteArea.

Kafka Data Source

SourceKafka

Create FlinkKafkaConsumer, configure bootstrap.servers, serializer, consumer group, etc.

Main Program Flow

KafkaToHBase

  1. Create execution environment
  2. Get Kafka data source
  3. Parse JSON data
  4. Distribute processing based on database and table name
  5. Write to corresponding HBase tables

Dimension Table Update Methods

  • Full Update: Overwrite each time new data is fetched
  • Incremental Update: Update existing records based on timestamp or version number to ensure data consistency