Tag: Flink
44 articles
AI Investigation #50: Big Data Evolution - Two Decades of Architecture Transformation from Hadoop Batch to Flink Real-time Computing
Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing. Architecture evolved from monolithic Hadoop to YARN multi-engine, then to cloud-native Kubernetes.
Big Data #268: Real-time Warehouse ODS Layer - Writing Kafka Dimension Tables to DIM
Writing dimension tables (DIM) from Kafka typically involves reading real-time or batch data from Kafka topics and updating dimension tables based on the data...
Big Data #269: Real-time Warehouse DIM, DW and ADS Layer Processing
DW (Data Warehouse layer) is built from DWD, DWS, and DIM layer data, completing data architecture and integration, establishing consistent dimensions, and...
Big Data #266: Canal Integration with Kafka - Real-time Data Warehouse
This article introduces Alibaba's open-source Canal tool, which implements Change Data Capture (CDC) by parsing MySQL binlog. Demonstrates how to integrate...
Realtime Warehouse - ODS Lambda Architecture Kappa Architecture Core Concepts
In internet companies, common ODS data includes business log data (Log) and business DB data. For business DB data, collecting data from relational databases...
Realtime Warehouse - Business Database Table Structure: Trade Orders, Order Products, Product Categories, Merchant Stores, Regional Organization Tables
Realtime data warehouse is a data warehouse system that differs from traditional batch processing data warehouses by emphasizing low latency, high throughput,...
Flink CEP: Complex Event Processing & Pattern Matching
Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.
Flink Memory Management: Network Buffer, State Backend & GC Tuning
Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.
Flink Parallelism: Operator Chaining, Slot & Resource Scheduling
Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.
Flink Broadcast State: BroadcastState Practice & Rule Updates
Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadcast stream join through cases.
Flink State Backend: State Storage & Performance Optimization
Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.
Flink State and Checkpoint: State Management, Fault Tolerance & Savepoint
Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.
Flink Streaming Introduction: DataStream API & Program Structure
Flink DataStream API getting started guide, program execution flow, environment acquisition, data source definition, operator chaining and execution mode details, demonstrating stream processing program development through WordCount case.
Flink Window and Watermark: Time Windows, Tumbling/Sliding, Session Windows & Late Data Processing
Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing mechanism.
Flink Installation & Deployment: Local, Standalone, YARN Modes
Complete tutorial for Apache Flink installation and deployment in three modes: Local, Standalone cluster, and YARN integration, including environment configuration, parameter tuning, and common issue solutions.
Flink on YARN Deployment: Environment Preparation, Resource Application & Job Submission
Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and job submission process.
Apache Flink Introduction: Unified Stream-Batch Real-Time Computing Engine
Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch processing model, and comparison with Spark Streaming for technology selection.
Flink Write to Kudu Practice: Custom Sink Full Process (Flink 1.11/Kudu 1.17/Java 11)
Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test). Through RichSinkFunction custom sink,...
Flink CEP Practice: 24 Hours ≥5 Transactions & 10 Minutes Unpaid Detection Cases
Flink CEP (Complex Event Processing) complex event processing mechanism, combined with actual cases to deeply explain its application principles and practical...
Flink SQL Quick Start | Table API + SQL in 3 Minutes with toChangelogStream
Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE), Table API and...
Flink CEP Deep Dive: Complex Event Processing Complete Guide
Flink CEP is the core component for real-time analysis of complex event streams in Flink, providing a complete pattern matching framework, supporting...
Flink CEP Timeout Event Extraction: Complete Guide with Malicious Login Detection Case
Flink CEP timeout event extraction is a key环节 in stream processing, used to capture partial matching events that exceed the window time (within) during pattern...
Flink StateBackend Deep Dive: Memory, Fs, RocksDB & OperatorState Management
ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale. Developers can use ManagedOperatorState by implementing CheckpointedFunction interface, supporting ListState and BroadcastState two data structures.
Flink Parallelism Deep Dive: From Concepts to Best Practices
In Flink, Parallelism is the core parameter measuring task concurrent processing capability, determining the number of tasks that can run simultaneously for...
Flink Broadcast State: Dynamic Logic Updates in Real-time Stream Computing
Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications, widely used in real-time risk control,...
Flink State Backend: Memory, Fs, RocksDB & Performance Differences
State Storage (State Backend) is the core mechanism for implementing stateful stream computing in Flink, determining data reliability, performance and fault...
Flink Parallelism Setting Priority: Principles, Configuration & Best Practices
A Flink program consists of multiple Operators (Source, Transformation, Sink). An Operator is executed by multiple parallel Tasks (threads), and the number of...
Flink State: Keyed State, Operator State & KeyGroups Working Principles
Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap...
Flink Time Semantics: EventTime, ProcessingTime, IngestionTime & Watermark Mechanism
Watermark is a special marker used to tell Flink the progress of events in the data stream. Simply put, Watermark is the 'current time' estimated by Flink in...
Flink Watermark Complete Guide: Event Time Window, Out-of-Order & Late Data
Flink's Watermark mechanism is one of the most core concepts in event time window computation, used for handling out-of-order events and ensuring accurate...
Flink Window Complete Guide: Tumbling, Sliding, Session
Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture. Flink treats batch as a special case of stream processing, using time windows (Tumbling, Sliding, Session) and count windows to split infinite streams into finite datasets.
Flink Sliding Window Deep Dive: Principles, Use Cases & Implementation Examples
Sliding Window is one of the core mechanisms in Apache Flink stream processing, more flexible than fixed windows, widely used in real-time monitoring, anomaly...
Flink JDBC Sink Deep Dive: MySQL Real-time Write, Batch Optimization & Best Practices
JDBC Sink is one of the most commonly used data output components, often used to write stream and batch processing results to relational databases like MySQL,...
Flink Batch Processing DataSet API: Use Cases, Code Examples & Optimization Mechanisms
Flink's DataSet API is the core programming interface for batch processing, designed for processing static, bounded datasets, supporting TB to PB scale big...
Flink DataStream Transformation: Map, FlatMap, Filter to Window Complete Guide
Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios. Common operators include Map, FlatMap and...
Flink Sink Usage Guide: Types, Fault Tolerance Semantics & Use Cases
Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media. It is the endpoint of streaming applications, determining how data is saved, transmitted or consumed.
Flink Source Operator Deep Dive: Non-Parallel Source Principles & Use Cases
Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are processed sequentially.
Flink SourceFunction to RichSourceFunction: Enhanced Source Functions & Practical Examples
RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.
Flink on YARN Deployment: Environment Variables, Configuration & Resource Application
Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations. First, configure environment...
Flink DataStream API: DataSource, Transformation & Sink Complete Guide
DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.
Flink Architecture Deep Dive: JobManager, TaskManager & Core Roles Overview
Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components. JobManager as Master is...
Flink Installation & Deployment Guide: Local, Standalone & YARN Modes
Flink provides multiple installation modes to suit different scenarios. Local mode is suitable for personal learning and small-scale debugging with simple...
Apache Flink Deep Dive: From Origin to Technical Features
Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data. With 'unified...
Flink Stream-Batch Integration Introduction: Concept Analysis & WordCount Code Practice
Apache Flink supports both stream processing and batch processing. Stream processing is suitable for real-time data like sensors, logs or trading streams,...