Tag: Flink

44 articles

AI Investigation #50: Big Data Evolution - Two Decades of Architectural Change from Hadoop to Flink

Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing.

8/13/2025

Big Data 268 - Real-time Warehouse ODS Layer: Writing Kafka Dimension Tables into DIM

Kafka is a distributed streaming platform for high-throughput message passing. In ETL processes, Kafka serves as a data message queue or stream processing source.

1/3/2025

Big Data 269 - Real-time Warehouse DIM, DW and ADS: Scala Pipelines to HBase

Original MySQL area table to HBase: Convert area table to region ID, region name, city ID, city name, province ID, province name, and write to HBase.

1/3/2025

Big Data #266: Canal Integration with Kafka - Real-time Data Sync

This article introduces Alibaba's open-source Canal tool, which implements Change Data Capture (CDC) by parsing MySQL binlog.

1/2/2025

Big Data 267 - Real-Time Warehouse ODS: Lambda and Kappa Architecture

In internet companies, common ODS data includes business log data (Log) and business DB data.

1/2/2025

Big Data 261 - Real-Time Warehouse Business Table Structure

Realtime data warehouse is a data warehouse system that differs from traditional batch processing data warehouses by emphasizing low latency, high throughput.

12/28/2024

Flink CEP: Complex Event Processing & Pattern Matching

Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.

12/21/2024

Flink Memory Management: Network Buffer, State Backend & Memory Model

Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.

12/18/2024

Big Data 99 - Flink Parallelism: Operator Chaining, Slot and Resource Scheduling

Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.

12/18/2024

Big Data 96 - Flink Broadcast State: BroadcastState Practice and Rule Updates

Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadc...

12/14/2024

Big Data 97 - Flink State Backend: State Storage and Performance Optimization

Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.

12/14/2024

Big Data 95 - Flink State and Checkpoint: State Management, Fault Tolerance and Savepoints

Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.

12/4/2024

Big Data 93 - Flink Streaming Introduction: DataStream API and Program Structure

This is article 93 in the Big Data series, introducing Flink DataStream API core concepts and program structure.

11/30/2024

Flink Window and Watermark: Time Windows, Tumbling/Sliding/Session

Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing...

11/30/2024

Big Data 91 - Flink Installation & Deployment: Local, Standalone and YARN Modes

Apache Flink is a distributed stream processing framework widely used for real-time data computing scenarios.

11/27/2024

Flink on YARN Deployment: Environment Preparation, Resource Manager

Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and...

11/27/2024

Big Data 90 - Apache Flink Introduction: Unified Stream-Batch Real-Time Computing

Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch p...

11/23/2024

Big Data 148 - Flink Write to Kudu: Custom Sink Full Practice

Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test).

9/25/2024

Big Data 131 - Flink CEP Practice: 24 Hours ≥5 Transactions & 10 Minutes Unpaid Detection Cases

Flink CEP (Complex Event Processing) is an extension library provided by Apache Flink for real-time complex event processing.

9/12/2024

Big Data 132 - Flink SQL Quick Start | Table API + SQL in 3 Minutes with toChangelogStream New Syntax

Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE).

9/12/2024

Flink CEP Deep Dive: Complex Event Processing Complete Guide

Flink CEP (Complex Event Processing) is a core component of Apache Flink, specifically designed for processing complex event streams.

9/11/2024

Flink CEP Timeout Event Extraction: Complete Guide with Matched and Timed-out Events

Flink CEP timeout event extraction is a key step in stream processing, used to capture partial matching events that exceed the window time (within) during pattern matchin...

9/11/2024

Flink StateBackend Deep Dive: Memory, Fs, RocksDB and Operator State

ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale.

9/10/2024

Flink Parallelism Deep Dive: From Concepts to Best Practices

Basic Concept of Parallelism In Apache Flink, Parallelism refers to the number of parallel tasks that can run simultaneously for each operator during execution.

9/10/2024

Big Data 125 - Flink Broadcast State: Dynamic Logic Updates in Real-Time Streaming

Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications.

9/9/2024

Big Data 126 - Flink State Backend: Memory, Fs, RocksDB and Performance Differences

State Storage Methods: MemoryStateBackend: Stores state in TaskManager's Java memory. Fast but limited (5MB per state default, 10MB per task).

9/9/2024

Flink Parallelism Setting Priority: Principles, Configuration and Tuning

A Flink program consists of multiple Operators (Source, Transformation, Sink).

9/7/2024

Big Data 124 - Flink State: Keyed State, Operator State and KeyGroups

Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap.

9/7/2024

Big Data 121 - Flink Time Semantics: EventTime, ProcessingTime, IngestionTime & Watermarks

Watermark is a special marker used to tell Flink the progress of events in the data stream.

9/6/2024

Big Data 122 - Flink Watermark Guide: Event Time, Out-of-Order Data and Late Events

When using event-time based windows, Flink relies on Watermark to decide when to trigger window computation.

9/6/2024

Flink Window Complete Guide: Tumbling, Sliding, Session

Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture.

9/5/2024

Flink Sliding Window Deep Dive: Principles, Use Cases and Implementation

Sliding window is a more generalized form of fixed window, achieving dynamic window movement through introducing slide interval. It consists of two key parameters

9/5/2024

Flink JDBC Sink Deep Dive: MySQL Real-time Write, Batch Output and Retry

In Apache Flink, JDBC Sink is an important data output component that allows writing stream or batch processed data to relational databases through JDBC connections.

9/4/2024

Flink Batch Processing DataSet API: Use Cases, Code Examples and Core Operators

Apache Flink's DataSet API is the core programming interface for Flink batch processing, specifically designed for processing static, bounded datasets.

9/4/2024

Big Data 115 - Flink DataStream Transformation: Map, FlatMap and Filter

Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios.

9/3/2024

Big Data 116 - Flink Sink Usage Guide: Types, Fault Tolerance Semantics & Scenarios

Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media.

9/3/2024

Flink Source Operator Deep Dive: Non-Parallel Source Principles

Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are proce...

9/2/2024

Flink SourceFunction to RichSourceFunction: Enhanced Source Lifecycle and Resource Management

RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.

9/2/2024

Big Data 111 - Flink on YARN Deployment: Environment Variables, Configuration & Resource Requests

Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations.

8/31/2024

Flink DataStream API: DataSource, Transformation and Sink Components

DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.

8/31/2024

Flink Architecture Deep Dive: JobManager, TaskManager and Client

Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components.

8/30/2024

Big Data 110 - Flink Installation and Deployment Guide: Local, Standalone and YARN

Flink provides multiple installation modes to suit different scenarios.

8/30/2024

Apache Flink Deep Dive: From Origin to Technical Features

Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data.

8/29/2024

Big Data 108 - Flink Stream-Batch Integration: Concepts & WordCount Practice

Definition: Stream processing means real-time processing of continuously flowing data streams.

8/29/2024