Big Data 260 - Realtime Warehouse

Project Background

With the development of the internet, data timeliness has become increasingly important for enterprise refined operations. Among the massive amounts of data generated daily, how to real-time extract valuable information greatly helps enterprises adjust their decision-making and operational strategies. Additionally, with the maturity and widespread application of 5G technology, enterprises with high timeliness requirements for internet and IoT data need real-time data systems to enhance their industry competitiveness.

As data timeliness becomes increasingly important in enterprise operations, examples include:

  • Real-time recommendations
  • Precision marketing
  • Advertising effectiveness
  • Real-time logistics

Data Warehouse Concepts

Offline Data Warehouse Architecture

Real-time Data Warehouse Architecture

Collection Layer

  • Binlog (business logs), IoT (Internet of Things), backend service logs (system logs)
  • After processing by the log collection team and DB collection team, data will be collected into Kafka. This data participates not only in real-time computing but also in offline computing.

Storage Layer

  • Kafka: Real-time incremental data
  • HDFS: State data storage and full data storage (persistence layer)
  • HBase: Dimension data storage

Engine Layer

Real-time processing framework

Platform Layer

Manage cluster resources from three perspectives: data, tasks, and resources

Application Layer

Application scenarios of the underlying architecture

  • Traffic data generation: Data generated from different channel tracking and different page tracking
  • Collection: Divide different business channels according to business dimensions
  • Application: Provide downstream business use in streaming mode, analysis of traffic aspects

Real-time Effectiveness Verification

  • CPV (Cost Per View): Also known as rich media ads, paid per display, i.e., charged for every time a website with the advertisement is opened.
  • CPC and CTR: In today’s advertising industry, CPC is difficult to use as an effectiveness metric; it serves more as a billing unit. CTR is sometimes still used as an effectiveness tool, mostly to measure different advertising strategies, optimization strategies, and creative quality.
  • Reach Rate: After an advertisement generates a click, subsequent metrics include reach. The click-to-reach rate is an important metric; a high reach rate reflects advertisement effectiveness.
  • Conversation Rate: The subsequent conversion rate of advertisements, from reach to conversion, is a metric for evaluating advertisement effectiveness.

Requirements Analysis

  • Log data: Startup logs, click logs, advertising logs
  • Business data: Analysis of core transaction data such as user orders, order submissions, payments, and refunds
  • Real-time statistics on advertising traffic: Generate dynamic blacklists
  • Malicious order brushing: Real-time warnings upon detection of malicious order brushing, behavior filtering based on dynamic blacklists, calculate click volume for each advertisement in the past hour every 5 minutes, calculate popular advertisements by province each day, calculate click trends for each advertisement in the past hour each day
  • Click source: Analyze where users come from from different dimensions
  • Channel quality: Analyze users on several aspects: visit duration, whether they made a purchase, first purchase amount, favorites, number of pages visited (PV)
  • Risk control: Real-time warnings when detecting transaction anomalies

Technology Selection

Technology Selection Approach

Framework selection: Apache, third-party distributions (CDH, HDP, Fusion Insight).

Advantages of Apache community version:

  • Completely open source and free
  • Active community
  • Detailed documentation and materials

Disadvantages of Apache community version:

  • Complex version management
  • Complex cluster installation
  • Complex cluster operations and maintenance
  • Complex ecosystem

Third-party distributions (CDH, HDP, Fusion Insight) - Hadoop follows the Apache open source protocol, users can freely modify and use Hadoop for free.

Advantages of these products:

  • Main features are consistent with the community
  • Clear version management
  • Enhanced compatibility, security, and stability compared to Apache Hadoop
  • Fast version updates
  • Based on stable Apache Hadoop versions, with the latest bug fixes applied
  • Provides deployment, installation, and configuration tools, greatly improving cluster deployment efficiency

CDH: The most mature distribution with the most deployment cases, providing powerful deployment, management, and monitoring tools, the most widely used in China, with strong community support

HDP: 100% open source, can be二次开发, but not as stable as CDH, relatively less used in China

Fusion Insight: Huawei developed based on Hadoop 2.7, adhering to the principles of layering, decoupling, and openness. Benefiting from high reliability, it has many cases in government, operators, and financial systems across the country.

Software Selection Approach

  • Data collection: Flume, Canal
  • Data storage: MySQL, Kafka, HBase, Redis
  • Data computing: Flink
  • OLAP: ClickHouse, Druid framework

Logical Architecture

Business Database Table Structure

Business database:

  • Trade orders table (trade_orders)
  • Order product table (order_product)
  • Product information table (product_info)
  • Product category table (product_category)
  • Merchant store table (shops)
  • Merchant regional organization table (shop_admin_org)
  • Payment method table (payments)