Tag: Flink
44 articles
AI Investigation #50: Big Data Evolution - Two Decades of Architectural Change from Hadoop to Flink
Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing.
Big Data 268 - Real-time Warehouse ODS Layer: Writing Kafka Dimension Tables into DIM
Kafka is a distributed streaming platform for high-throughput message passing. In ETL processes, Kafka serves as a data message queue or stream processing source.
Big Data 269 - Real-time Warehouse DIM, DW and ADS: Scala Pipelines to HBase
Original MySQL area table to HBase: Convert area table to region ID, region name, city ID, city name, province ID, province name, and write to HBase.
Big Data #266: Canal Integration with Kafka - Real-time Data Sync
This article introduces Alibaba's open-source Canal tool, which implements Change Data Capture (CDC) by parsing MySQL binlog.
Big Data 267 - Real-Time Warehouse ODS: Lambda and Kappa Architecture
In internet companies, common ODS data includes business log data (Log) and business DB data.
Big Data 261 - Real-Time Warehouse Business Table Structure
Realtime data warehouse is a data warehouse system that differs from traditional batch processing data warehouses by emphasizing low latency, high throughput.
Flink CEP: Complex Event Processing & Pattern Matching
Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.
Flink Memory Management: Network Buffer, State Backend & Memory Model
Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.
Big Data 99 - Flink Parallelism: Operator Chaining, Slot and Resource Scheduling
Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.
Big Data 96 - Flink Broadcast State: BroadcastState Practice and Rule Updates
Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadc...
Big Data 97 - Flink State Backend: State Storage and Performance Optimization
Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.
Big Data 95 - Flink State and Checkpoint: State Management, Fault Tolerance and Savepoints
Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.
Big Data 93 - Flink Streaming Introduction: DataStream API and Program Structure
This is article 93 in the Big Data series, introducing Flink DataStream API core concepts and program structure.
Flink Window and Watermark: Time Windows, Tumbling/Sliding/Session
Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing...
Big Data 91 - Flink Installation & Deployment: Local, Standalone and YARN Modes
Apache Flink is a distributed stream processing framework widely used for real-time data computing scenarios.
Flink on YARN Deployment: Environment Preparation, Resource Manager
Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and...
Big Data 90 - Apache Flink Introduction: Unified Stream-Batch Real-Time Computing
Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch p...
Big Data 148 - Flink Write to Kudu: Custom Sink Full Practice
Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test).
Big Data 131 - Flink CEP Practice: 24 Hours ≥5 Transactions & 10 Minutes Unpaid Detection Cases
Flink CEP (Complex Event Processing) is an extension library provided by Apache Flink for real-time complex event processing.
Big Data 132 - Flink SQL Quick Start | Table API + SQL in 3 Minutes with toChangelogStream New Syntax
Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE).
Flink CEP Deep Dive: Complex Event Processing Complete Guide
Flink CEP (Complex Event Processing) is a core component of Apache Flink, specifically designed for processing complex event streams.
Flink CEP Timeout Event Extraction: Complete Guide with Matched and Timed-out Events
Flink CEP timeout event extraction is a key step in stream processing, used to capture partial matching events that exceed the window time (within) during pattern matchin...
Flink StateBackend Deep Dive: Memory, Fs, RocksDB and Operator State
ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale.
Flink Parallelism Deep Dive: From Concepts to Best Practices
Basic Concept of Parallelism In Apache Flink, Parallelism refers to the number of parallel tasks that can run simultaneously for each operator during execution.
Big Data 125 - Flink Broadcast State: Dynamic Logic Updates in Real-Time Streaming
Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications.
Big Data 126 - Flink State Backend: Memory, Fs, RocksDB and Performance Differences
State Storage Methods: MemoryStateBackend: Stores state in TaskManager's Java memory. Fast but limited (5MB per state default, 10MB per task).
Flink Parallelism Setting Priority: Principles, Configuration and Tuning
A Flink program consists of multiple Operators (Source, Transformation, Sink).
Big Data 124 - Flink State: Keyed State, Operator State and KeyGroups
Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap.
Big Data 121 - Flink Time Semantics: EventTime, ProcessingTime, IngestionTime & Watermarks
Watermark is a special marker used to tell Flink the progress of events in the data stream.
Big Data 122 - Flink Watermark Guide: Event Time, Out-of-Order Data and Late Events
When using event-time based windows, Flink relies on Watermark to decide when to trigger window computation.
Flink Window Complete Guide: Tumbling, Sliding, Session
Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture.
Flink Sliding Window Deep Dive: Principles, Use Cases and Implementation
Sliding window is a more generalized form of fixed window, achieving dynamic window movement through introducing slide interval. It consists of two key parameters
Flink JDBC Sink Deep Dive: MySQL Real-time Write, Batch Output and Retry
In Apache Flink, JDBC Sink is an important data output component that allows writing stream or batch processed data to relational databases through JDBC connections.
Flink Batch Processing DataSet API: Use Cases, Code Examples and Core Operators
Apache Flink's DataSet API is the core programming interface for Flink batch processing, specifically designed for processing static, bounded datasets.
Big Data 115 - Flink DataStream Transformation: Map, FlatMap and Filter
Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios.
Big Data 116 - Flink Sink Usage Guide: Types, Fault Tolerance Semantics & Scenarios
Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media.
Flink Source Operator Deep Dive: Non-Parallel Source Principles
Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are proce...
Flink SourceFunction to RichSourceFunction: Enhanced Source Lifecycle and Resource Management
RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.
Big Data 111 - Flink on YARN Deployment: Environment Variables, Configuration & Resource Requests
Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations.
Flink DataStream API: DataSource, Transformation and Sink Components
DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.
Flink Architecture Deep Dive: JobManager, TaskManager and Client
Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components.
Big Data 110 - Flink Installation and Deployment Guide: Local, Standalone and YARN
Flink provides multiple installation modes to suit different scenarios.
Apache Flink Deep Dive: From Origin to Technical Features
Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data.
Big Data 108 - Flink Stream-Batch Integration: Concepts & WordCount Practice
Definition: Stream processing means real-time processing of continuously flowing data streams.