Big Data Engineering

Article 27

ZooKeeper Cluster Configuration Details and Startup Verification

Deep dive into zoo.cfg core parameter meanings, explain myid file configuration specifications, demonstrate 3-node cluster startup process and Leader election result veri...

7/31/2024

Article 28

ZooKeeper ZNode Data Structure and Watcher Mechanism Details

Deep dive into ZooKeeper's four ZNode node types, ZXID transaction ID structure, and one-time trigger Watcher monitoring mechanism principles and practice.

7/31/2024

Big DataZookeeperDistributed SystemJava

Article 29

ZooKeeper Watcher Principle and Command Line Practice Guide

Complete analysis of Watcher registration-trigger-notification flow from client, WatchManager to ZooKeeper server, and zkCli command line practice demonstrating node CRUD...

8/3/2024

Article 30

ZooKeeper Java API Practice: Node CRUD and Monitoring

Use ZkClient library to operate ZooKeeper via Java code, complete practical examples of session establishment, persistent node CRUD, child node change monitoring...

8/3/2024

Big DataZookeeperJavaDistributed System

Article 31

ZooKeeper Leader Election and ZAB Protocol Principles

This is article 31 in the Big Data series. Deep analysis of ZooKeeper Leader election mechanism and ZAB (ZooKeeper Atomic Broadcast) protocol implementation principles.

8/7/2024

Big DataZookeeperJavaDistributed System

Article 32

ZooKeeper Distributed Lock Java Implementation Details

This is article 32 in the Big Data series. Demonstrates how to implement fair distributed lock using ZooKeeper ephemeral sequential nodes, with complete Java code.

8/7/2024

Article 33

Big Data 33 - HBase Overall Architecture: HMaster, HRegionServer and Data Model

Comprehensive analysis of HBase distributed database overall architecture, including ZooKeeper coordination, HMaster management node, HRegionServer data node...

8/10/2024

Big DataHbaseDistributed SystemData Engineering

Article 34

HBase Single Node Configuration: hbase-env and hbase-site.xml

Step-by-step configure HBase single node environment, explain hbase-env.sh, hbase-site.xml key parameters, complete integration with Hadoop HDFS and ZooKeeper cluster.

8/10/2024

Big DataHbaseData Engineering

Article 35

HBase Cluster Deployment and High Availability Configuration

This is article 35 in the Big Data series. Complete HBase distributed cluster deployment on three-node Hadoop + ZooKeeper cluster.

8/14/2024

Big DataHbaseDistributed System

Article 36

HBase Shell CRUD Operations and Data Model

HBase Shell commands: create table, Put/Get/Scan/Delete operations, explain HBase data model with practical examples.

8/14/2024

Big DataHbase

Article 37

Big Data 37 - HBase Java API: Complete CRUD Code with Table Creation

Using HBase Java Client API to implement table creation, insert, delete, Get query, full table scan, and range scan.

8/17/2024

Big DataHbaseJavaData Engineering

Article 38

Redis Introduction: Features and Architecture

Introduction to Redis: in-memory data structure store, key-value database, with comparison to traditional databases and typical use cases.

8/17/2024

Big DataRedis

Article 39

Redis Single Node and Cluster Installation

Install Redis 6.2.9 from source on Ubuntu, configure redis.conf for daemon mode, start redis-server and verify connection via redis-cli.

8/21/2024

Big DataRedisData Engineering

Article 40

Big Data 40 - Redis Five Data Types: Command Reference and Practice

Comprehensive explanation of Redis five data types: String, List, Set, Sorted Set, and Hash. Includes common commands, underlying characteristics, and typical usage scena...

8/21/2024

Article 41

Redis Advanced Data Types: Bitmap, Geo and Stream

Deep dive into Redis three advanced data types: Bitmap, Geo (GeoHash, Z-order curve, Base32 encoding), and Stream message stream, with common commands and practical examp...

8/24/2024

Big DataRedisDistributed SystemCaching

Article 42

Redis Pub/Sub: Mechanism, Weak Transaction and Risks

Detailed explanation of Redis Pub/Sub working mechanism, three weak transaction flaws (no persistence, no acknowledgment, no retry), and alternative solutions in producti...

8/24/2024

Article 43

Spark Streaming Stateful Transformations: Window Operations and State Tracking

Window operations integrate data from multiple batches over a longer time range by setting window length and slide duration.

8/26/2024

Big DataSpark

Article 44

Big Data 102 - Spark Streaming with Kafka: Receiver and Direct Approaches

This article introduces two Spark Streaming integration methods with Kafka: Receiver Approach and Direct Approach.

8/26/2024

Big DataSparkKafka

Article 45

Spark Streaming Kafka Consumption: Offset Acquisition, Storage and Management

When Spark Streaming integrates with Kafka, Offset management is key to ensuring data processing continuity and consistency.

8/27/2024

Big DataSparkKafka

Article 46

Big Data 104 - Spark Streaming with Kafka: Offset Management Mechanisms & Best Practices

Offset is used to mark message position in Kafka partition. Proper management can achieve at-least-once or even exactly-once data processing semantics.

8/27/2024

Big DataSparkKafkaRedis

Article 47

Redis Lua Scripts: EVAL, redis.call and Atomic Operations

Systematic explanation of Redis Lua script EVAL command syntax, differences between redis.call and redis.

8/28/2024

Big DataRedisJavaCaching

Article 48

Redis Slow Query Log and Performance Tuning in Production

Detailed explanation of Redis slow query log configuration parameters (slowlog-log-slower-than, slowlog-max-len), core commands.

8/28/2024

Article 49

Apache Flink Deep Dive: From Origin to Technical Features

Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data.

8/29/2024

Article 50

Big Data 108 - Flink Stream-Batch Integration: Concepts & WordCount Practice

Definition: Stream processing means real-time processing of continuously flowing data streams.

8/29/2024

Article 51

Flink Architecture Deep Dive: JobManager, TaskManager and Client

Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components.

8/30/2024

Article 52

Big Data 110 - Flink Installation and Deployment Guide: Local, Standalone and YARN

Flink provides multiple installation modes to suit different scenarios.

8/30/2024

Article 53

Big Data 111 - Flink on YARN Deployment: Environment Variables, Configuration & Resource Requests

Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations.

Big DataFlinkYARNJava

Article 54

Flink DataStream API: DataSource, Transformation and Sink Components

DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.

Big DataFlinkDatastreamJava

Article 55

Redis Persistence: RDB vs AOF Comparison and Production Settings

Systematic comparison of Redis two persistence solutions: RDB snapshot and AOF log — configuration methods, trigger mechanisms, pros and cons, AOF rewrite mechanism.

Article 56

Big Data 46 - Redis RDB Persistence: Snapshot Principles, Configuration and Tradeoffs

In-depth analysis of Redis RDB persistence mechanism, covering trigger methods, BGSAVE execution flow, configuration parameters, file structure, and comparison with AOF.

Big DataFlinkDatastreamJava

Article 57

Flink Source Operator Deep Dive: Non-Parallel Source Principles

Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are proce...

9/2/2024

Big DataFlinkSourceJava

Article 58

Flink SourceFunction to RichSourceFunction: Enhanced Source Lifecycle and Resource Management

RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.

9/2/2024

Article 59

Big Data 115 - Flink DataStream Transformation: Map, FlatMap and Filter

Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios.

9/3/2024

Big DataFlinkDatastreamJava

Article 60

Big Data 116 - Flink Sink Usage Guide: Types, Fault Tolerance Semantics & Scenarios

Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media.

9/3/2024

Big DataFlinkSinkJava

Article 61

Flink JDBC Sink Deep Dive: MySQL Real-time Write, Batch Output and Retry

In Apache Flink, JDBC Sink is an important data output component that allows writing stream or batch processed data to relational databases through JDBC connections.

Big DataFlinkJDBCMysqlJava

Article 62

Flink Batch Processing DataSet API: Use Cases, Code Examples and Core Operators

Apache Flink's DataSet API is the core programming interface for Flink batch processing, specifically designed for processing static, bounded datasets.

Big DataFlinkDatasetJava

Article 63

Redis Memory Management: Key Expiration and Eviction Policies

Comprehensive analysis of Redis memory control mechanisms, including maxmemory configuration, three key expiration deletion strategies (lazy/active/scheduled).

Article 64

Big Data 48 - Redis Communication Internals: RESP Protocol and Reactor Model

This is article 48 in the Big Data series. This article provides an in-depth analysis of Redis communication protocol RESP and Reactor-based event-driven architecture.

Big DataRedisDistributed SystemCaching

Article 65

Flink Window Complete Guide: Tumbling, Sliding, Session

Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture.

9/5/2024

Big DataFlinkWindowJava

Article 66

Flink Sliding Window Deep Dive: Principles, Use Cases and Implementation

Sliding window is a more generalized form of fixed window, achieving dynamic window movement through introducing slide interval. It consists of two key parameters

9/5/2024

Big DataFlinkWindowJava

Article 67

Big Data 121 - Flink Time Semantics: EventTime, ProcessingTime, IngestionTime & Watermarks

Watermark is a special marker used to tell Flink the progress of events in the data stream.

9/6/2024

Article 68

Big Data 122 - Flink Watermark Guide: Event Time, Out-of-Order Data and Late Events

When using event-time based windows, Flink relies on Watermark to decide when to trigger window computation.

9/6/2024

Article 69

Flink Parallelism Setting Priority: Principles, Configuration and Tuning

A Flink program consists of multiple Operators (Source, Transformation, Sink).

Article 70

Big Data 124 - Flink State: Keyed State, Operator State and KeyGroups

Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap.

Article 71

Redis Cache Problems: Penetration, Breakdown, Avalancheand Solutions

Systematic overview of the five most common Redis cache problems in high-concurrency scenarios: cache penetration, cache breakdown, cache avalanche, hot key, and big key.

Big DataRedisCachingDistributed System

Article 72

Big Data 50 - Redis Distributed Lock: Optimistic Lock, WATCH and SETNX

Redis optimistic lock in practice: WATCH/MULTI/EXEC mechanism explained, Lua scripts for atomic operations, SETNX+EXPIRE distributed lock from basics to Redisson...

Big DataRedisJavaDistributed SystemCaching

Article 73

Big Data 125 - Flink Broadcast State: Dynamic Logic Updates in Real-Time Streaming

Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications.

9/9/2024

Article 74

Big Data 126 - Flink State Backend: Memory, Fs, RocksDB and Performance Differences

State Storage Methods: MemoryStateBackend: Stores state in TaskManager's Java memory. Fast but limited (5MB per state default, 10MB per task).

9/9/2024

Article 75

Flink StateBackend Deep Dive: Memory, Fs, RocksDB and Operator State

ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale.

9/10/2024

Article 76

Flink Parallelism Deep Dive: From Concepts to Best Practices

Basic Concept of Parallelism In Apache Flink, Parallelism refers to the number of parallel tasks that can run simultaneously for each operator during execution.

9/10/2024

Article 77

Flink CEP Deep Dive: Complex Event Processing Complete Guide

Flink CEP (Complex Event Processing) is a core component of Apache Flink, specifically designed for processing complex event streams.

Article 78

Flink CEP Timeout Event Extraction: Complete Guide with Matched and Timed-out Events

Flink CEP timeout event extraction is a key step in stream processing, used to capture partial matching events that exceed the window time (within) during pattern matchin...

Article 79

Redis High Availability: Master-Slave Replication & Sentinel

This is article 51 in the Big Data series, covering Redis high availability architecture: master-slave replication, Sentinel mode, and distributed lock design.

Big DataRedisDistributed SystemCaching

Article 80

Kafka Architecture: High-Throughput Distributed Messaging

Systematic introduction to Kafka core architecture: Topic/Partition/Replica model, ISR mechanism, zero-copy optimization, message format and typical use cases.

Article 81

Big Data 131 - Flink CEP Practice: 24 Hours ≥5 Transactions & 10 Minutes Unpaid Detection Cases

Flink CEP (Complex Event Processing) is an extension library provided by Apache Flink for real-time complex event processing.

9/12/2024

Article 82

Big Data 132 - Flink SQL Quick Start | Table API + SQL in 3 Minutes with toChangelogStream New Syntax

Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE).

9/12/2024

Article 83

Big Data 133 - ClickHouse Concepts & Basics | Why Fast? Columnar + Vectorized + MergeTree Comparison

Scenario: Want high-concurrency low-latency OLAP, and don't want to use entire Hadoop/lakehouse.

9/13/2024

Article 84

Big Data 134 - ClickHouse Single Machine + Cluster Node Deployment Guide | Installation Configuration | systemd Management / config.d

Official recommended keyring + signed-by installation of ClickHouse on Ubuntu, start with systemd and self-check

9/13/2024

Article 85

Big Data 135 - ClickHouse Cluster Connectivity Self-Check & Data Types Guide | Run ON CLUSTER in 10 Minutes

Using three-node cluster (h121/122/123) as example, first complete cluster connectivity self-check: system.

Article 86

Big Data 136 - ClickHouse Table Engines: TinyLog/Log/StripeLog/Memory/Merge Selection Guide

Scenario: Need to trade-off among small data/temporary table/log landing/multi-table combined reads, often using MergeTree is "using a cannon to kill a mosquito".

Article 87

Kafka Components: Producer, Broker, Consumer Full Flow

Deep dive into Kafka's three core components: Producer partitioning strategy and ACK mechanism, Broker Leader/Follower architecture, Consumer Group partition assignment a...

Article 88

Kafka Installation: From ZooKeeper to KRaft Evolution

Introduction to Kafka 2.x vs 3.x core differences, detailed cluster installation steps, ZooKeeper configuration, Broker parameter settings, and how KRaft mode replaces Zo...

Big DataKafkaMessagingData Engineering

Article 89

Big Data 137 - ClickHouse MergeTree Practical Guide

ClickHouse MergeTree key mechanisms: batch writes form parts, background merge (Compact/Wide two part forms).

Article 90

Big Data 138 - ClickHouse MergeTree Deep Dive: Partition Pruning × Sparse Primary Index × Marks × Compression

ClickHouse MergeTree storage and query path: column files (*.bin), sparse primary index (primary.idx), marker files (.mrk/.

Article 91

Kafka Operations: Shell Commands & Java Client Examples

Covers Kafka daily operations: daemon startup, Shell topic management commands, and Java client programming (complete Producer/Consumer code) with key configuration param...

Big DataKafkaMessagingJavaData Engineering

Article 92

Spring Boot Integration with Kafka

Detailed guide on integrating Kafka in Spring Boot projects, including dependency configuration, KafkaTemplate sync/async message sending, and complete @KafkaListener con...

Big DataKafkaSpring BootJavaMessaging

Article 93

Spark Distributed Environment Setup

Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.

Big DataSparkData Engineering

Article 94

Big Data 139 - ClickHouse MergeTree Best Practices: Replacing Deduplication, Summing Aggregation, Partition Design & Materialized View Alternatives

Scenario: Solve two common "quasi-real-time detail table" requirements: deduplication/update and key-based summing.

9/19/2024

Big DataClickhouseZookeeper

Article 95

Big Data 140 - ClickHouse CollapsingMergeTree & External Data Sources

ClickHouse external data source engine guide: DDL templates, key parameters and read/write pipelines for ENGINE=HDFS, ENGINE=MySQL, ENGINE=Kafka, and distributed table co...

9/19/2024

Big DataClickhouseHadoop

Article 96

Big Data 141 - ClickHouse Replicas: ReplicatedMergeTree and ZooKeeper

ReplicatedMergeTree ZooKeeper: Implements communication between multiple instances.

9/20/2024

Article 97

ClickHouse Sharding × Replica × Distributed: ReplicatedMergeTree

Replica refers to storing the same data on different physical nodes in a distributed system. Its core idea is to improve system reliability through data redundancy.

9/20/2024

Big DataClickhouseDistributed

Article 98

ClickHouse MergeTree Partition/TTL, Materialized View, ALTER

ClickHouse is a columnar database for OLAP (Online Analytical Processing), favored in big data analysis for its high-speed data processing.

9/21/2024

Big DataClickhouseMergetree

Article 99

Kafka Producer Message Sending Flow & Core Parameters

Deep analysis of Kafka Producer initialization, message interception, serialization, partition routing, buffer batch sending, ACK confirmation and complete sending chain.

9/21/2024

Article 100

Kafka Serialization & Partitioning: Custom Implementation

Deep dive into Kafka message serialization and partition routing, including complete code for custom Serializer and Partitioner, mastering precise message routing and eff...

9/21/2024

Article 101

Apache Kudu: Real-time Write + OLAP Architecture, Performance

Apache Kudu is an open-source storage engine developed by Cloudera and contributed to Apache Software Foundation.

9/23/2024

Big DataKuduOLAP

Article 102

Apache Kudu Architecture & Practice: RowSet, Partition & Raft Consensus

Apache Kudu's Master/TabletServer architecture, RowSet (MemRowSet/DiskRowSet) write/read path, MVCC, and Raft consensus role in replica and failover

9/23/2024

Big DataKuduRaft

Article 103

Apache Kudu Docker Quick Deployment: 3 Master/5 TServer Pattern

Apache Kudu Docker Compose quick deployment solution on Ubuntu 22.04 cloud host, covering Kudu Master and Tablet Server components.

9/24/2024

Big DataKuduDocker

Article 104

Big Data 147 - Java Access Apache Kudu: From Table Creation to CRUD

Java client (kudu-client 1.4.0) connects to Apache Kudu with multiple Masters (example ports 7051/7151/7251), completes full process of table creation.

9/24/2024

Big DataKuduJava

Article 105

Big Data 148 - Flink Write to Kudu: Custom Sink Full Practice

Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test).

9/25/2024

Big DataFlinkKudu

Article 106

Kafka Producer Interceptor & Interceptor Chain

Introduction to Kafka 0.10 Producer interceptor mechanism, covering onSend and onAcknowledgement interception points, interceptor chain execution order and error isolatio...

9/25/2024

Article 107

Big Data 60 - Kafka Consumer: Consumption Flow, Heartbeat and Parameter Tuning

Detailed explanation of Kafka Consumer Group consumption model, partition assignment strategy, heartbeat keep-alive mechanism, and tuning practices for key parameters lik...

9/25/2024

Article 108

Apache Druid Real-time OLAP Architecture & Selection Points

Apache Druid real-time OLAP practice: suitable for event detail with time as primary key, sub-second aggregation and high-concurrency self-service analysis.

9/27/2024

Big DataDruidOLAP

Article 109

Big Data 150 - Apache Druid Single-Machine Deployment: Architecture Overview and Startup

Scenario: Quickly experience Apache Druid 30.0.0 locally/single-machine, verify real-time and historical queries and console access.

9/27/2024

Big DataDruidDeployment

Article 110

Apache Druid Cluster Deployment [Part 1]: MySQL Metadata Store

Scenario: 2C4G/2C2G three-node mixed deployment, Druid 30.0.0, Kafka/HDFS/MySQL collaboration. Conclusion: Can run on low config, but core is DirectMemory and processing.

Article 111

Apache Druid Cluster Mode [Part 2]: Low-Memory Cluster Practice

Low-memory cluster practice for Apache Druid 30.0.0 on three nodes: provides JVM parameters and runtime.

Article 112

Kafka Topic, Partition & Consumer: Rebalance Optimization

Deep dive into Kafka Topic, Partition, Consumer Group core mechanisms, covering custom deserialization, offset management and rebalance optimization configuration.

Article 113

Kafka Topic Management: Commands & Java API

Comprehensive introduction to Kafka Topic operations, including kafka-topics.sh commands, replica assignment strategy principles, and KafkaAdminClient Java API core usage.

Big DataKafkaMessagingJavaData Engineering

Article 114

Big Data 153 - Apache Druid Real-time Kafka Ingestion: Complete Practice from Ingestion to Query

Complete practice of Apache Druid real-time Kafka ingestion, using network traffic JSON as example, completing data ingestion through Druid console's Streaming/Kafka wiza...

9/29/2024

Big DataDruidKafka

Article 115

Apache Druid Architecture & Component Responsibilities: Coordinator/Overlord

Apache Druid component responsibilities and deployment points from 0.13.0 to current (2025): Coordinator manages Historical node Segment.

9/29/2024

Article 116

Big Data 155 - Apache Druid Storage & Query Architecture: Segment, Chunk, Roll-up & Bitmap Indexes

Apache Druid data storage and high-performance query path: from DataSource/Chunk/Segment layering, to columnar storage, Roll-up pre-aggregation, Bitmap.

9/30/2024

Article 117

Big Data 156 - Apache Druid + Kafka Real-time Analysis: JSON Flattening, Ingestion & SQL Metrics

Scala Kafka Producer writes order/click data to Kafka Topic (example topic: druid2), continuous ingestion in Druid through Kafka Indexing Service.

9/30/2024

Big DataDruidKafka

Article 118

Kafka Replica Mechanism: ISR & Leader Election

Deep dive into Kafka replica mechanism, including ISR sync node set maintenance, Leader election process, and unclean election trade-offs between consistency and availabi...

10/2/2024

Article 119

Kafka Exactly-Once: Idempotence & Transactions

Systematic explanation of how Kafka achieves Exactly-Once semantics through idempotent producers and transactions, covering PID/sequence number principle...

10/2/2024

Big DataKafkaMessagingData Engineering

Article 120

Kafka Storage Mechanism: Log Segmentation & Retention

This is article 65 in the Big Data series, deeply analyzing Kafka's log storage mechanism.

10/5/2024

Article 121

Kafka High Performance: Zero-Copy, mmap & Sequential Write

This is article 66 in the Big Data series, deeply analyzing Kafka's underlying I/O optimization technologies achieving extremely high throughput.

10/5/2024

Article 122

Apache Kylin Comprehensive Guide: MOLAP Architecture, Hive Integration

Background, evolution and engineering practice of Apache Kylin, focusing on MOLAP solution implementation path for massive data analysis.

10/8/2024

Article 123

Big Data 158 - Apache Kylin 3.1.1 Deployment on Hadoop, Hive and HBase

Complete deployment record of Apache Kylin 3.1.1 on Hadoop 2.9.2, Hive 2.3.9, HBase 1.3.1, Spark 2.4.5 (without-hadoop.

10/8/2024

Article 124

Apache Kylin Cube Practice: Hive Load & Pre-computation Acceleration

Apache Kylin is an open-source distributed analysis engine, focused on providing real-time OLAP (Online Analytical Processing) capabilities for big data.

10/9/2024

Article 125

Apache Kylin Cube Practice: From Modeling to Build and Query

Scenario: Using e-commerce sales fact table, pre-compute aggregation queries accelerated by "date" dimension on Kylin.

10/9/2024

Big DataSparkDistributed SystemData EngineeringStream Processing

Article 126

From MapReduce to Spark: Big Data Computing Evolution

Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core compon...

10/9/2024

Article 127

Big Data 161 - Apache Kylin Cube Practice: Modeling, Building and Query Acceleration

Apache Kylin 4.0 Cube modeling and query acceleration method: Complete star modeling with fact tables and dimension tables, design dimensions and measures.

10/10/2024

Article 128

Apache Kylin Incremental Cube & Segment Practice: Daily Partition Column

Using date field of Hive partitioned table as Partition Date Column, split Cube into multiple Segments, incrementally build by range to avoid repeated computation of hist...

10/10/2024

Article 129

Apache Kylin Segment Merge Practice: Manual/Auto Merge, Retention Threshold

Apache Kylin Segment merge practice tutorial, covering manual MERGE Job flow, continuous Segment requirements, Auto Merge multi-level threshold strategy...

10/11/2024

Article 130

Big Data 164 - Apache Kylin Cuboid Pruning Practice: Derived Dimensions & Expansion Control

Cuboid pruning optimization: When there are many dimensions, Cuboid count grows exponentially, causing long build time and storage expansion.

10/11/2024

Article 131

Big Data 165 - Apache Kylin Cube7 Practice: Aggregation Group, RowKey and Encoding

Covers Aggregation Group, Mandatory Dimension, Hierarchy Dimension, Joint Dimension usage trade-offs, and explains impact of dictionary encoding, RowKey order.

Article 132

Apache Kylin 1.6 Streaming Cubing Practice: Kafka to Minute-level OLAP

Kafka→Kylin real-time OLAP pipeline, providing minute-level aggregation queries for common 2025 business scenarios (e-commerce transactions, user behavior...

Big DataKylinKafka

Article 133

Spark RDD Deep Dive: Five Key Features

This is article 69 in the Big Data series, deeply analyzing RDD, Spark's core data abstraction, its five key features and design principles.

Big DataSparkJavaData EngineeringStream Processing

Article 134

Spark RDD Creation & Transformation Operations

This is article 70 in the Big Data series, comprehensively explaining Spark RDD's three creation methods and practical usage of common Transformation operators.

Big DataSparkJavaData Engineering

Article 135

Big Data 167 - ELK Elastic Stack Practice: Architecture, Indexing and Troubleshooting

Article introduces core capabilities and common practices of Elasticsearch 8.x, Logstash 8.x, Kibana 8.

10/13/2024

Big DataElasticsearchElk

Article 136

Elasticsearch Single Machine Cloud Server Deployment & Operations

Elasticsearch is a distributed full-text search engine, supports single-node mode and cluster mode deployment. Generally, small companies can use Single-Node Mode for the...

10/13/2024

Article 137

Big Data 169 - Elasticsearch Getting Started: Index/Document CRUD & Minimum Search Examples

Elasticsearch (ES 7.x/8.x) minimum examples for index creation, document CRUD, query by ID, and _search, with response samples and screenshots to quickly run through the...

10/14/2024

Article 138

Big Data 170 - Elasticsearch 7.3.0 Three-Node Cluster Practice

Elasticsearch 7.3.0 three-node cluster deployment practice tutorial, covering directory creation and permission settings.

10/14/2024

Big DataElasticsearchElkKibana

Article 139

Big Data 171 - Elasticsearch-Head and Kibana 7.3.0 Practice

Introduction to Elasticsearch-Head plugin and Kibana 7.3.0 installation and connectivity points, covering Chrome extension quick access.

10/15/2024

Article 140

Elasticsearch Index Operations & IK Analyzer Practice: 7.3/8.x

This article explains Elasticsearch index CRUD operations and IK analyzer config, covering versions 7.3.0 and 8.15.0.

10/15/2024

Big DataElasticsearchElkIk Analyzer

Article 141

Big Data 173 - Elasticsearch Mapping and Document CRUD Practice

After creating an index, need to set field constraints, called field mapping (mapping).

10/16/2024

Big DataElasticsearchElkCrud

Article 142

Elasticsearch Query DSL Practice: match/match_phrase/query_string/multi_match

In-depth explanation of core Query DSL usage in Elasticsearch 7.3, focusing on differences and pitfalls of match, matchphrase, querystring.

10/16/2024

Big DataElasticsearchElkDSL

Article 143

Spark Cluster Architecture & Deployment Modes

This is article 71 in the Big Data series, introducing Spark cluster core architecture, deployment mode comparisons, and static/dynamic resource management strategies.

10/16/2024

Big DataSparkDistributed SystemData Engineering

Article 144

Big Data 175 - Elasticsearch Term Queries and Bool Combination Practice

This article demonstrates Elasticsearch term-level queries including term, terms, range, exists, prefix, regexp, fuzzy, ids queries, and bool compound queries.

10/17/2024

Big DataElasticsearchElkTerm Query

Article 145

Big Data 176 - Elasticsearch Filter DSL Practice: Filter Queries, Pagination and Highlighting

This article details practical usage of Elasticsearch Filter DSL, covering filter query, sort pagination, highlight display and batch operations.

10/17/2024

Big DataElasticsearchElkFilter

Article 146

Elasticsearch Aggregation Practice: Metrics Aggregations & Bucket Aggregations

Covers complete practice of Metrics Aggregations and Bucket Aggregations, applicable to common Elasticsearch 7.x / 8.x versions in 2025.

10/18/2024

Big DataElasticsearchElkAggregation

Article 147

Big Data 178 - Elasticsearch 7.3 Java Practice: Index and Document CRUD

This article details the complete flow for index and document CRUD operations using Elasticsearch 7.3.0 and RestHighLevelClient.

10/18/2024

Big DataElasticsearchElkJava

Article 148

Spark Action Operations Overview

This is article 72 in the Big Data series, systematically reviewing Spark RDD Action operators.

10/19/2024

Big DataElasticsearchElkInverted Index

Article 149

Big Data 179 - Elasticsearch Inverted Index and Read/Write Process

This article deeply analyzes Elasticsearch's inverted index principle based on Lucene, and document read/write flow.

10/20/2024

Article 150

Big Data 180 - Elasticsearch Near Real-Time Search: Segment, Refresh and Flush

Article details core mechanism of Elasticsearch near real-time search, including Lucene Segment, Memory Buffer, File System Cache, Refresh, Flush and Translog.

10/20/2024

Big DataElasticsearchElkNRT

Article 151

Big Data 181 - Elasticsearch Segment Merge & Disk Directory Breakdown

Explains why refresh causes small segment increase, how segment merge merges small segments into large ones in background and cleans deleted documents.

10/21/2024

Article 152

Big Data 182 - Elasticsearch Inverted Index Underlying Breakdown

Article details core data structure of Elasticsearch inverted index: Terms Dictionary, Posting List, FST (Finite State Transducer) and SkipList how accelerate.

10/21/2024

Article 153

Big Data 183 - Elasticsearch Concurrency Conflicts & Optimistic Lock

Elasticsearch concurrency conflicts (inventory deduction read-modify-write) breakdown write overwrite cause, and gives engineering solution using ES optimistic.

10/22/2024

Article 154

Big Data 184 - Elasticsearch Doc Values Mechanism Detailed

Disk columnar data structure generated at indexing time, optimized for sorting, aggregation and script values

10/22/2024

Article 155

Big Data 185 - Logstash 7 Getting Started: stdin/file Collection, sincedb, start_position & Error Quick Reference

Logstash 7 getting started tutorial, covering stdin/file collection, sincedb mechanism and start_position effect conditions, with error quick reference table

Big DataLogstashElk

Article 156

Big Data 186 - Logstash JDBC vs Syslog Input: Principles, Scenarios & Reusable Configurations

Logstash Input plugin comparison, breakdown technical differences between JDBC Input and Syslog collection pipeline, applicable scenarios and key configs.

Big DataLogstashElkJDBC

Article 157

Spark Scala WordCount Implementation

Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.

Article 158

Spark Scala Practice: Pi Estimation & Mutual Friends

Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two appr...

Big DataLogstashElkElasticsearch

Article 159

Big Data 187 - Logstash Filter Plugin Practice

Filter is responsible for parsing, transforming, filtering events. Multiple Filters execute in configured order.

10/24/2024

Big DataLogstashElkGrok

Article 160

Big Data 188 - Logstash Output Plugin Practice

Output is the final stage of Logstash pipeline, responsible for outputting processed data to target system.

10/24/2024

Article 161

Big Data 189 - Nginx JSON Logs to ELK: ZK + Kafka + Elasticsearch 7.3.0 + Kibana 7.3.0

Configure Nginx logformat json to output structured accesslog (containing @timestamp, requesttime, status, requesturi, ua and other fields).

10/25/2024

Big DataNginxKafkaElasticsearchKibana

Article 162

Filebeat → Kafka → Logstash → Elasticsearch Practice

Filebeat collects Nginx access.log to Kafka, and Logstash consumes, parses embedded JSON by field conditions, enriches metadata, and writes structured logs to Elasticsear...

10/25/2024

Big DataFilebeatKafkaLogstashElasticsearch

Article 163

Big Data 191 - Elasticsearch Cluster Planning & Tuning: Node Roles, Shards, Replicas, Write and Search Checklist

Master / Data / Coordinating node responsibilities and production role isolation strategies, capacity planning calculations.

Article 164

Big Data 192 - DataX 3.0 Architecture & Practice

Scenario: Offline sync MySQL/HDFS/Hive/OTS/ODPS and other heterogeneous data sources, batch migration and data warehouse ETL.

Big DataDatax

Article 165

Spark Super WordCount: Text Cleaning & MySQL Persistence

This is article 75 in the Big Data series, on top of basic WordCount add text preprocessing and database persistence, build a near-production word frequency pipeline.

Big DataSparkScalaJavaData Engineering

Article 166

Spark Serialization & RDD Execution Principle

This is article 76 in the Big Data series, systematically reviewing Spark process communication mechanism, serialization strategy and RDD execution principle.

Big DataSparkScalaDistributed System

Article 167

Big Data 193 - Apache Tez Practice: Hive on Tez Installation, DAG Principles & Common Pitfalls

Tez (pronounced "tez") is an efficient data processing framework running in the Hadoop ecosystem, designed to optimize batch processing and interactive queries.

10/28/2024

Big DataTezHive

Article 168

Big Data 194 - Data Mining Overview: From Wine Classification to Supervised, Unsupervised & Reinforcement Learning

In a bar, there are ten almost identical glasses of wine on the counter. The boss says want to play a game, win and drink for free, lose and pay three times for the wine.

10/28/2024

Big DataMachine LearningData Mining

Article 169

Big Data 195 - KNN/K-Nearest Neighbors Algorithm Practice

KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python.

10/29/2024

Big DataMachine LearningKnn

Article 170

Big Data 196 - scikit-learn KNN Practice: KNeighborsClassifier, kneighbors & Learning Curves

Since being initiated in 2007 by David Cournapeau, scikit-learn (sklearn) has become one of the most important machine learning libraries in the Python ecosystem.

10/29/2024

Big DataMachine LearningSklearnKnn

Article 171

Big Data 197 - K-Fold Cross-Validation Practice

Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation.

Big DataMachine LearningCross Validation

Article 172

Big Data 198 - KNN Must Normalize First: Min-Max Scaling, Data Leakage Pitfalls & sklearn Practice

In scikit-learn pipelines, distance-based models like KNN are highly sensitive to inconsistent feature scales. Split first, fit MinMaxScaler only on the training set...

Big DataMachine LearningKnnNormalization

Article 173

Spark RDD Fault Tolerance: Checkpoint Principle & Best Best Practices

Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long...

Article 174

Spark Broadcast Variables: Efficient Shared Read-Only Data

Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices.

Big DataMachine LearningDecision Tree

Article 175

Big Data 199 - Decision Tree Model Explained: Node Structure, Conditional Probability & Shannon Entropy

Tree model is a widely used algorithm type in supervised learning, can be applied to both classification and regression problems.

10/31/2024

Article 176

Big Data 200 - Decision Tree Information Gain Detailed

Scenario: Use information entropy/information gain to explain why decision tree selects certain column for splitting, and use Python to reproduce "best split column".

10/31/2024

Big DataMachine LearningDecision Tree

Article 177

Big Data 201 - Decision Tree from Split to Pruning

Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks.

11/1/2024

Big DataMachine LearningSklearnDecision TreePython

Article 178

Big Data 202 - sklearn Decision Tree Practice: criterion, Graphviz Visualization & Pruning

Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version).

11/1/2024

Big DataMachine LearningSklearnDecision TreePython

Article 179

Big Data 203 - sklearn Decision Tree Pruning Parameters

Common parameters for decision tree pruning (pre-pruning) in engineering: maxdepth, minsamplesleaf, minsamplessplit, maxfeatures, minimpuritydecrease.

Big DataMachine LearningSklearnDecision TreePython

Article 180

Big Data 204 - Confusion Matrix to ROC: Imbalanced Binary Classification Metrics in sklearn

Confusion matrix (TP, FP, FN, TN) with unified metrics: Accuracy, Precision, Recall (Sensitivity), F1 Measure, ROC curve, AUC value, and practical business interpretation...

Big DataMachine LearningSklearnEvaluation MetricsPython

Article 181

Spark Standalone Mode: Architecture & Performance Tuning

Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and...

Big DataSparkDistributed SystemData Engineering

Article 182

SparkSQL Introduction: SQL & Distributed Computing Fusion

Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integra...

Big DataSparkScalaSQLData Engineering

Article 183

Big Data 205 - Linear Regression Machine Learning Perspective

Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown

11/4/2024

Big DataMachine LearningLinear RegressionPythonNumpy

Article 184

Big Data 206 - NumPy Matrix Multiplication Hand-written Multivariate Linear Regression

pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation).

11/4/2024

Big DataMachine LearningLinear RegressionPythonNumpy

Article 185

Big Data 207 - How to Handle Multicollinearity

When using scikit-learn for linear regression, how to handle multicollinearity in least squares method.

11/5/2024

Big DataMachine LearningLinear RegressionSklearnPython

Article 186

Big Data 208 - Ridge Regression and Lasso Regression

Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine.

11/5/2024

Big DataMachine LearningRidge RegressionLassoRegularization

Article 187

Big Data 209 - Deep Understanding of Logistic Regression

Logistic Regression (LR) is an important classification algorithm in machine learning.

Big DataMachine LearningLogistic RegressionGradient DescentPython

Article 188

Big Data 210 - How to Implement Logistic Regression in Scikit-Learn and Regularization Detailed (L1 and L2)

As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8.

Big DataMachine LearningLogistic RegressionSklearnRegularization

Article 189

SparkSQL Core Abstractions: RDD, DataFrame, Dataset & SparkSession

This is article 81 in the Big Data series, comprehensively introducing Spark's three core data abstractions' features, use cases and mutual conversions.

Big DataSparkScalaSQLData Engineering

Article 190

SparkSQL Operators: Transformation & Action Operations

This is article 82 in the Big Data series, systematically introducing SparkSQL Transformation and Action operators with complete test cases.

Big DataSparkScalaSQLData Engineering

Article 191

Big Data 211 - Scikit-Learn Logistic Regression Implementation

When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy.

11/7/2024

Article 192

Big Data 212 - K-Means Clustering Guide

K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed).

11/7/2024

Article 193

Big Data 213 - Python Hand-Written K-Means Clustering

Scenario: Hand-write K-Means using NumPy/Pandas, perform 3-class clustering on Iris.txt and output centroids with clustering results.

11/8/2024

Article 194

Big Data 214 - K-Means Clustering Practice: Self-Implemented Algorithm vs sklearn

K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification.

11/8/2024

Article 195

sklearn KMeans Key Attributes & Evaluation: cluster_cluster_centers_、inertia_、metrics

Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.

Article 196

Big Data 216 - KMeans n_clusters Selection

KMeans nclusters selection method: calculate silhouettescore and silhouette_samples on candidate cluster numbers (e.g.

Article 197

SparkSQL Statements: DataFrame Operations, SQL Queries &

Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for meta...

SparkScalaSQLHiveData Engineering

Article 198

Big Data 84 - SparkSQL Internals: Five Join Strategies & Catalyst Optimizer

This is article 84 in the Big Data series, deeply analyzing SparkSQL kernel's Join strategy auto-selection logic and SQL parsing optimization flow.

SparkScalaSQLDistributed SystemData Engineering

Article 199

Big Data 217 - Prometheus 2.53.2 Installation and Configuration Practice

Scenario: Single-machine deployment of Prometheus 2.53.2, pull node_exporter metrics from multiple hosts and verify Targets status.

11/11/2024

Big DataPrometheusMonitoringExporter

Article 200

Big Data 218 - Prometheus Node Exporter 1.8.2 and Pushgateway 1.10.0

Common Prometheus monitoring deployment: Install node_exporter-1.8.2 on Rocky Linux to expose host metrics, integrate with Prometheus scrape config, and visualize in Graf...

11/11/2024

Big DataPrometheusMonitoringExporter

Article 201

Big Data 219 - Grafana 11.3.0 Installation & Startup: YUM, systemd and Login Setup

For OPs/devs still using CentOS/RHEL (including compatible distributions) in 2026, provides Grafana 11.3.0 (grafana-enterprise-11.3.0-1.x86_64.

11/12/2024

Big DataGrafanaMonitoringVisualization

Article 202

Big Data 220 - Data Warehouse Introduction

In 1988, IBM first introduced the concept of "Information Warehouse" when facing increasingly scattered enterprise information systems and growing data silo problems.

11/12/2024

Big DataData WarehouseOLAPETL

Article 203

Big Data 221 - Offline Data Warehouse Layering: ODS, DWD, DWS, DIM and ADS Architecture

Scenario: The more department-built data marts, the more inconsistent definitions, disconnected interfaces, forming data silos, and exploding data query costs.

Article 204

Offline Data Warehouse Modeling Practice

In data warehouse architecture, Fact Table is the core table structure that stores business process metric values or facts.

Article 205

Spark Streaming Introduction: From DStream to Structured Streaming

This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark's two generations of streaming frameworks.

Big DataSparkScalaStream ProcessingData Engineering

Article 206

Spark Streaming Data Sources: File Stream, Socket, RDD RDD Queue

Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation.

SparkScalaStream ProcessingKafkaData Engineering

Article 207

Big Data 223 - How to Build an Offline Data Warehouse: Tracking, Metrics and Thematic Analysis

Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.

11/14/2024

Article 208

Offline Data Warehouse Architecture Selection and Cluster Design

Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and.

11/14/2024

Article 209

Offline Data Warehouse Member Metrics Practice

Scenario: Use startup logs/event logs in offline data warehouse to count new, active (DAU/WAU/MAU), retention.

11/15/2024

Article 210

Big Data 226 - Flume Optimization for Offline Data Warehouse: batchSize, Channels, Compression, Interceptors & OOM Tuning

Flume 1.9.0 tuning guide for offline data warehouse log collection to HDFS, covering batch parameters, channel capacity and transaction sizing, JVM heap tuning...

11/16/2024

Big DataSparkScalaStream ProcessingData Engineering

Article 211

Big Data 87 - Spark DStream Transformation Operators: map, reduceByKey and transform

Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blac...

11/16/2024

Article 212

Spark Streaming Window Operations & State Tracking: updateStateByKey & mapWithState

In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-stat...

11/16/2024

Big DataSparkScalaStream ProcessingData Engineering

Article 213

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection, Filter by Logtime

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype

11/18/2024

Article 214

Big Data 228 - Flume Taildir + Custom Interceptor: Extract JSON Timestamps, Mark Headers & Partition HDFS by Event Time

Apache Flume offline log collection implementation using Taildir Source and a custom Interceptor to extract JSON timestamps, mark headers, and route HDFS partitions by ev...

11/19/2024

Article 215

Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily.

11/20/2024

SparkKafkaScalaStream ProcessingData Engineering

Article 216

Big Data 89 - Spark Streaming with Kafka: Receiver vs Direct Mode

This is article 89 in the Big Data series, deeply comparing two core modes of Spark Streaming integration with Kafka, focusing on Direct mode production practices.

11/20/2024

Article 217

Hive ODS Layer JSON Parsing: UDF Array Extraction, explode/json_tuple

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL

11/21/2024

Big DataData WarehouseHiveHDFSDatax

Article 218

Offline Data Warehouse Hive Practice: DWD to DWS Daily/Weekly/Monthly Active Member

This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly).

11/22/2024

Article 219

Big Data 232 - Hive New Member & Retention

Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'.

11/23/2024

Big DataData WarehouseHiveHDFSDatax

Article 220

Big Data 90 - Apache Flink Introduction: Unified Stream-Batch Real-Time Computing

Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch p...

11/23/2024

Big DataFlinkDistributed SystemStream ProcessingData Engineering

Article 221

Big Data 233 - Offline Data Warehouse Retention Rate: DWS Modeling & ADS Hive Aggregation

Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dwsmemberretention_day table to join new member and startup detail tables to.

11/25/2024

Big DataData WarehouseHiveHDFS

Article 222

Offline Data Warehouse Hive ADS Export MySQL DataX Practice

The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter.

11/26/2024

Data WarehouseHiveHDFSDataxMysql

Article 223

Big Data 235 - Offline Data Warehouse Practice: Flume, HDFS and Hive for ODS, DWD, DWS and ADS

This article demonstrates a complete offline data warehouse pipeline from log collection to member metric analysis.

11/27/2024

Big DataData WarehouseHiveHDFSFlume

Article 224

Big Data 91 - Flink Installation & Deployment: Local, Standalone and YARN Modes

Apache Flink is a distributed stream processing framework widely used for real-time data computing scenarios.

11/27/2024

Big DataFlinkData Engineering

Article 225

Flink on YARN Deployment: Environment Preparation, Resource Manager

Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and...

11/27/2024

Big DataFlinkDistributed SystemData Engineering

Article 226

Big Data 236 - Offline Data Warehouse Member Metrics Verification, DataX Export and Advertising Pipeline

This is a practical verification article for member theme and advertising business pipeline based on Hadoop + Hive + HDFS + DataX + MySQL.

11/28/2024

Data WarehouseHiveHDFSDataxMysql

Article 227

Big Data 237 - Offline Data Warehouse Hive Advertising Practice: ODS to DWD Event Parsing

This article introduces completing parsing, cleaning, and detail modeling from ODS to DWD for offline data warehouse based on advertising events in tracking logs.

11/29/2024

Big DataData WarehouseHiveHDFSFlume

Article 228

Offline Data Warehouse Advertising Business Hive Analysis: CTR/CVR/Top100

action: User behavior; 0 impression; 1 click after impression; 2 purchase duration: Stay duration shopid: Merchant id eventtype: "ad" adtype: Format type; 1 JPG; 2 PNG

11/30/2024

Big DataData WarehouseHiveHDFS

Article 229

Big Data 93 - Flink Streaming Introduction: DataStream API and Program Structure

This is article 93 in the Big Data series, introducing Flink DataStream API core concepts and program structure.

11/30/2024

Big DataFlinkStream ProcessingData Engineering

Article 230

Flink Window and Watermark: Time Windows, Tumbling/Sliding/Session

Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing...

11/30/2024

Big DataFlinkStream ProcessingData Engineering

Article 231

Offline Data Warehouse Advertising Business: Flume Import HDFS + ODS/DWD

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date.

12/2/2024

Big DataData WarehouseHiveHDFSFlume

Article 232

Big Data 240 - Offline Data Warehouse Advertising Hive ADS Practice: DataX Export to MySQL

Complete solution for exporting Hive ADS layer data to MySQL using DataX.

12/3/2024

Data WarehouseHiveHDFSDataxMysql

Article 233

Big Data 241 - Offline Data Warehouse Practice: E-commerce Core Transaction Data Model & MySQL Source Table Design

Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).

12/4/2024

Big DataData WarehouseHiveMysql

Article 234

Big Data 95 - Flink State and Checkpoint: State Management, Fault Tolerance and Savepoints

Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.

12/4/2024

Big DataFlinkStream ProcessingData Engineering

Article 235

Big Data 243 - Offline Data Warehouse: E-commerce Core Transaction Incremental Import

Scenario: Three core e-commerce transaction tables do daily incremental to offline data warehouse ODS, partitioned by dt Conclusion: DataX uses MySQLReader + HDFSWriter.

12/6/2024

Big DataData WarehouseDataxHDFSHive

Article 236

Big Data 244 - Offline Data Warehouse: Hive ODS Layer

Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning.

12/7/2024

Big DataData WarehouseHiveODS

Article 237

Big Data 245 - Offline Data Warehouse - Hive Zipper Table Getting Started: SCD Types, Table Creation and Loading

Slowly Changing Dimensions (SCD) refer to dimension attributes that change slowly over time in the real world (slow is relative to fact tables, where data changes faster...

12/9/2024

Big DataData WarehouseHiveSCD

Article 238

Big Data 246 - Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script

userinfo (partitioned table) => userid, mobile, regdate => Daily changed data (modified + new) / Historical data (first day) userhis (zipper table) => Two additional fiel...

12/10/2024

Big DataData WarehouseHiveZipper Table

Article 239

Big Data 247 - Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh

This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking an...

12/11/2024

Big DataData WarehouseHiveZipper Table

Article 240

Big Data 247 - Offline Data Warehouse - Hive Order Zipper Table: Incremental Refresh Implementation

This article continues the zipper table practice, focusing on order history state incremental refresh.

12/12/2024

Big DataData WarehouseHiveZipper Table

Article 241

Big Data 248 - Offline Data Warehouse: Dimension Tables

First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables.

12/12/2024

Big DataData WarehouseHiveDimension Table

Article 242

Big Data 249 - Offline Data Warehouse DWD and DWS Layer

The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables.

12/12/2024

Big DataData WarehouseHiveDWDDWS

Article 243

Offline Data Warehouse ADS Layer and Airflow Task Task Scheduling

Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.

Big DataData WarehouseHiveADSAirflow

Article 244

Big Data 251 - Airflow Installation

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.

Article 245

Big Data 96 - Flink Broadcast State: BroadcastState Practice and Rule Updates

Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadc...

FlinkStream ProcessingState ManagementBig Data

Article 246

Big Data 97 - Flink State Backend: State Storage and Performance Optimization

Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.

FlinkStream ProcessingState ManagementBig Data

Article 247

Big Data 252 - Airflow Crontab Scheduling

Linux systems use the cron (crond) service for scheduled tasks, which is enabled by default.

12/15/2024

Article 248

Big Data 253 - Airflow Core Concepts

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.

12/16/2024

Article 249

Airflow Core Trade Task Scheduling Integration for Offline Data Warehouse

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.

12/17/2024

FlinkStream ProcessingMemory ManagementBig Data

Article 250

Flink Memory Management: Network Buffer, State Backend & Memory Model

Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.

12/18/2024

Article 251

Big Data 99 - Flink Parallelism: Operator Chaining, Slot and Resource Scheduling

Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.

12/18/2024

FlinkStream ProcessingParallelismBig Data

Article 252

Big Data 255 - Atlas Data Warehouse Metadata Management

Metadata, in its narrowest sense, refers to data that describes other data.

12/20/2024

Big DataData WarehouseAtlas

Article 253

Flink CEP: Complex Event Processing & Pattern Matching

Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.

12/21/2024

FlinkStream ProcessingCEPBig Data

Article 254

Big Data 256 - Atlas Installation

Metadata (MetaData) in the narrow sense refers to data that describes data.

12/21/2024

Big DataData WarehouseAtlas

Article 255

Big Data 257 - Data Quality Monitoring

Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection.

12/23/2024

Big DataData WarehouseGriffin

Article 256

Big Data 258 - Griffin with Livy: Architecture, Installation, and Usage

Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios.

12/24/2024

Big DataData WarehouseGriffin

Article 257

Big Data 259 - Griffin Configuration

Apache Griffin is an open-source data quality management framework designed to help organizations monitor and improve data quality in big data environments.

12/25/2024

Big DataData WarehouseGriffin

Article 258

Big Data 260 - Real-Time Data Warehouse: Background, Architecture and Requirements

Real-time data processing capability has become a key competitive factor for enterprises.

12/27/2024

Big DataData WarehouseRealtime

Article 259

Big Data 261 - Real-Time Warehouse Business Table Structure

Realtime data warehouse is a data warehouse system that differs from traditional batch processing data warehouses by emphasizing low latency, high throughput.

12/28/2024

Big DataData WarehouseCanal

Article 260

Canal Data Sync: Introduction, Background, Principles and Architecture

Alibaba B2B's cross-region business between domestic sellers and overseas buyers drove the need for data synchronization between Hangzhou and US data centers.

12/29/2024

Article 261

Big Data 263 - Canal Working Principle: Workflow and MySQL Binlog Basics

Canal is an open-source tool for MySQL database binlog incremental subscription and consumption.

12/30/2024

Big DataData WarehouseCanal

Article 262

MySQL Binlog Deep Dive: Storage Directory, Change Records and Format

MySQL's Binary Log (binlog) is a log file type in MySQL that records all change operations performed on the database (excluding SELECT and SHOW queries).

12/30/2024

Big DataData WarehouseMysql

Article 263

Big Data 265 - Canal Deployment

Canal is an open-source data synchronization tool from Alibaba for MySQL database incremental log parsing and synchronization.

12/31/2024

Big DataData WarehouseCanal

Article 264

Big Data #266: Canal Integration with Kafka - Real-time Data Sync

This article introduces Alibaba's open-source Canal tool, which implements Change Data Capture (CDC) by parsing MySQL binlog.

1/2/2025

Article 265

Big Data 267 - Real-Time Warehouse ODS: Lambda and Kappa Architecture

In internet companies, common ODS data includes business log data (Log) and business DB data.

1/2/2025

Big DataMachine LearningSparkMLlibScala

Article 266

Spark MLlib Linear Regression: Scenarios, Loss Function and Optimization

Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent.

1/2/2025

Article 267

Big Data 268 - Real-time Warehouse ODS Layer: Writing Kafka Dimension Tables into DIM

Kafka is a distributed streaming platform for high-throughput message passing. In ETL processes, Kafka serves as a data message queue or stream processing source.

1/3/2025

Article 268

Big Data 269 - Real-time Warehouse DIM, DW and ADS: Scala Pipelines to HBase

Original MySQL area table to HBase: Convert area table to region ID, region name, city ID, city name, province ID, province name, and write to HBase.

1/3/2025

Article 269

Big Data 271 - Spark MLlib Logistic Regression: Sigmoid, Loss Function & Diabetes Prediction Case

Logistic Regression is a classification model in machine learning. Despite having "regression" in its name, it is a classification algorithm.

1/3/2025

Article 270

Big Data 271 - Spark MLlib Linear Regression: Scenarios, Loss Function & Optimization

Linear regression uses regression equations to model relationships between independent and dependent variables.

4/11/2025

Article 271

Big Data 272 - Spark MLlib Logistic Regression: Basics, Input Function, Sigmoid & Loss

This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression.

5/27/2025

Article 272

Spark MLlib Decision Tree: Classification Principles, Gini/Entropy and Practice

This article introduces the basic concepts, classification principles, and classification principles of decision trees.

5/28/2025

Article 273

Spark MLlib Decision Tree Pruning: Pre-pruning, Post-Principles and Practice

This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms.

5/29/2025

Article 274

Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Methods

This article systematically introduces ensemble learning methods in machine learning.

6/2/2025

Big DataMachine LearningSpark

Article 275

Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting

Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles.

6/3/2025

Article 276

Spark MLlib GBDT Algorithm: Gradient Boosting Principles,and Applications

This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm.

6/3/2025