Tag: 大数据
284 articles
AI Research #131: Java 17/21/25 Complete Comparison
Java 17 (2021), Java 21 (2023), Java 25 (2025) language and JVM changes, covering Virtual Threads (Project Loom), Records/Pattern Matching (Project Amber),...
AI Investigation #54: Big Data Industry Applications and Technology Selection Trends
Big data has achieved deep integration in finance, e-commerce, internet, communications, manufacturing, healthcare, education and other industries, becoming the core engine for business innovation.
AI Investigation #53: Big Data Talent Landscape - Experience Distribution, Growth Paths and Industry Trends
The talent structure in the big data industry shows characteristics of youth and rapid growth. The 25-30 age group is the main force, while 30-35 year-olds are gradually becoming the core strength.
AI Investigation #52: Big Data Technology Landscape - Lakehouse, Data Mesh, Serverless and Emerging Trends
Big data technology is undergoing a new wave of transformation. Lakehouse architecture combines the advantages of data lakes and data warehouses. Data Mesh...
AI Investigation #51: Big Data Technology Evolution - Obsolete Frameworks and Architectures
Big data technology evolution: MapReduce replaced by Spark, Storm replaced by Flink, Pig/Hive gradually phased out. This article analyzes why these technologies were eliminated and the technical reasoning behind the evolution.
AI Investigation #50: Big Data Evolution - Two Decades of Architecture Transformation from Hadoop Batch to Flink Real-time Computing
Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing. Architecture evolved from monolithic Hadoop to YARN multi-engine, then to cloud-native Kubernetes.
AI Research 49 - Big Data Survey Report: Development History: From Concept Birth to Diversified Ecosystem 1997-2025
Big data development began in 1997 when NASA proposed the concept, 2003-2006 Google published GFS, MapReduce, Bigtable three major papers leading distributed computing revolution. 2005 saw Hadoop born, 2008 became Apache top project forming complete ecosystem.
Spark MLlib GBDT Case Study: Residual Calculation to Regression Tree Iteration
GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training. Covers GBDT...
Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting Trees
Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles. Main content:...
Spark MLlib GBDT Algorithm: Gradient Boosting Principles, Negative Gradient and XGBoost
This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm. First explains boosting tree basic concept through simple examples, then details algorithm flow including negative gradient calculation, regression tree fitting, and model update steps.
Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Explained
This article systematically introduces ensemble learning methods in machine learning. Main content includes: 1) Basic definition and classification of ensemble...
Spark MLlib Decision Tree Pruning: Pre-pruning, Post-pruning and ID3 vs C4.5 vs CART
This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms...
Spark MLlib Decision Tree: Classification Principles, Gini Coefficient and Entropy
This article introduces the basic concepts, classification principles, and classification principles of decision trees. Decision tree is a non-linear...
Spark MLlib Logistic Regression: Input Function, Sigmoid, Loss and Diabetes Prediction
This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression. Logistic regression is an efficient binary classification algorithm widely used in fields such as ad click-through rate prediction and spam email identification.
Spark MLlib Linear Regression: Scenarios, Loss Function and Gradient Descent Optimization
Linear regression uses regression equations to model relationships between independent and dependent variables. This article covers regression scenarios (house...
Big Data #268: Real-time Warehouse ODS Layer - Writing Kafka Dimension Tables to DIM
Writing dimension tables (DIM) from Kafka typically involves reading real-time or batch data from Kafka topics and updating dimension tables based on the data...
Big Data #269: Real-time Warehouse DIM, DW and ADS Layer Processing
DW (Data Warehouse layer) is built from DWD, DWS, and DIM layer data, completing data architecture and integration, establishing consistent dimensions, and...
Spark MLlib Logistic Regression: Sigmoid, Loss Function and Diabetes Prediction Case
Logistic regression is a classification model in machine learning — an efficient binary classification algorithm widely used in ad click-through rate...
Big Data #266: Canal Integration with Kafka - Real-time Data Warehouse
This article introduces Alibaba's open-source Canal tool, which implements Change Data Capture (CDC) by parsing MySQL binlog. Demonstrates how to integrate...
Realtime Warehouse - ODS Lambda Architecture Kappa Architecture Core Concepts
In internet companies, common ODS data includes business log data (Log) and business DB data. For business DB data, collecting data from relational databases...
Spark MLlib Linear Regression: Scenarios, Loss Function and Gradient Descent
Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent...
Canal Deployment: Installation, Service Startup and Common Issues
Canal is an open-source data synchronization tool from Alibaba for MySQL database incremental log parsing and synchronization. It simulates the MySQL slave...
Canal Working Principle: Workflow and MySQL Binlog Introduction
Canal is an open-source tool for MySQL database binlog incremental subscription and consumption, primarily used for data synchronization and distributed...
MySQL Binlog Deep Dive: Storage Directory, Change Records and Canal Configuration
MySQL's Binary Log (binlog) is a log file type in MySQL that records all change operations performed on the database (excluding SELECT and SHOW queries). It is...
Canal Data Sync: Introduction, Background, Principles and Advantages
Alibaba B2B's cross-region business between domestic sellers and overseas buyers drove the need for data synchronization between Hangzhou and US data centers.
Realtime Warehouse - Business Database Table Structure: Trade Orders, Order Products, Product Categories, Merchant Stores, Regional Organization Tables
Realtime data warehouse is a data warehouse system that differs from traditional batch processing data warehouses by emphasizing low latency, high throughput,...
Real-time Data Warehouse: Background, Architecture, Requirements and Technology Selection
Real-time data processing capability has become a key competitive factor for enterprises. Initially, each new requirement spawned a separate real-time task,...
Apache Griffin Configuration: pom.xml, sparkProperties and Build Startup
Apache Griffin is an open-source data quality management framework designed to help organizations monitor and improve data quality in big data environments.
Big Data 258 - Griffin with Livy: Architecture, Installation and Hadoop/Hive Configuration
Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios. Its main...
Big Data 257 - Data Quality Monitoring: Monitoring Methods and Griffin Architecture
Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection. It can measure data assets from...
Flink CEP: Complex Event Processing & Pattern Matching
Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.
Big Data 256 - Atlas Installation: Service Startup, Web Access and Hive Lineage Import
Metadata (MetaData) in the narrow sense refers to data that describes data. Broadly, all information beyond business data used to maintain system operation can...
Big Data 255 - Atlas Data Warehouse Metadata Management: Data Lineage and Metadata
Atlas is a metadata framework for the Hadoop platform: a set of scalable core governance services enabling enterprises to effectively meet compliance...
Flink Memory Management: Network Buffer, State Backend & GC Tuning
Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.
Flink Parallelism: Operator Chaining, Slot & Resource Scheduling
Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.
Airflow Core Trade Task Scheduling Integration for Offline Data Warehouse
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...
Airflow Core Concepts: DAG, Operators, Tasks and Python Script Writing
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...
Airflow Crontab Scheduling: Introduction, Task Integration and Getting Started
Linux systems use the cron (crond) service for scheduled tasks, which is enabled by default. Linux also provides the crontab command for user-level task...
Offline Data Warehouse ADS Layer and Airflow Task Scheduling System
Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.
Apache Airflow Installation and Deployment for Offline Data Warehouse
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...
Flink Broadcast State: BroadcastState Practice & Rule Updates
Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadcast stream join through cases.
Flink State Backend: State Storage & Performance Optimization
Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.
Offline Data Warehouse - Hive Order Zipper Table: Incremental Refresh Implementation
This article continues the zipper table practice, focusing on order history state incremental refresh. It explains how to use ODS daily incremental table + DWD zipper table to preserve order historical states at low cost while supporting daily backtracking and change analysis.
Offline Data Warehouse Dimension Tables: Product Category, Region Organization, Product Information
First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables. Dimension table processing strategies vary by...
Offline Data Warehouse DWD and DWS Layer: Table Creation and ETL Scripts
The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables. Order statuses...
Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh
This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking and change analysis. It covers incremental refresh of order status changes using 2020 order data as a case study.
Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script
This article provides a practical guide to Hive zipper table implementation for offline data warehouse modeling, covering initial loading, daily incremental updates, historical version chain closing, Shell scheduling scripts, and rollback recovery logic.
Offline Data Warehouse - Hive Zipper Table Getting Started: SCD Types, Table Creation and Loading
This article systematically covers Slowly Changing Dimensions (SCD), detailing the core differences between SCD Type 0, 1, 2, 3, 4, and 6, and explains the applicable boundaries of snapshot tables and zipper tables in Hive offline data warehouse scenarios.
Offline Data Warehouse: Hive ODS Layer Table Creation and Partition Loading Practice
Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning. Enables fast queries of raw transaction records within 7 days, demonstrating core characteristics of ODS layer.
Offline Data Warehouse: E-commerce Core Transaction Incremental Import (DataX - HDFS - Hive Partition)
Using DataX (MySQLReader + HDFSWriter) to extract daily incremental data from MySQL order tables, order detail tables, and product information tables into...
Offline Data Warehouse Practice: E-commerce Core Transaction Data Model & MySQL Source Table Design
Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).
Flink State and Checkpoint: State Management, Fault Tolerance & Savepoint
Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.
Offline Data Warehouse Advertising Business Hive ADS Practice: DataX Export HDFS Partition Table to MySQL
Complete solution for exporting Hive ADS layer data to MySQL using DataX. Covers ADS loading, DataX configuration, MySQL table creation, Shell script parameterized execution, and common error diagnosis and fix checklist.
Offline Data Warehouse Advertising Business: Flume Import Logs to HDFS, Complete Hive ODS/DWD Layer Loading
Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structure, log file upload, Flume startup command, HDFS write verification.
Offline Data Warehouse Advertising Business Hive Analysis Practice: ADS Click-Through Rate, Purchase Rate and Top100 Ranking
Implementation of advertising impression, click, purchase hourly statistics based on Hive offline data warehouse, completing CTR, CVR and advertising effect...
Flink Streaming Introduction: DataStream API & Program Structure
Flink DataStream API getting started guide, program execution flow, environment acquisition, data source definition, operator chaining and execution mode details, demonstrating stream processing program development through WordCount case.
Flink Window and Watermark: Time Windows, Tumbling/Sliding, Session Windows & Late Data Processing
Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing mechanism.
Offline Data Warehouse Hive Advertising Business Practice: ODS→DWD Event Parsing, Advertising Detail and Conversion Analysis
Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...
Offline Data Warehouse Member Metrics Verification, DataX Export and Advertising Business ODS/DWD/ADS Full Process
Offline data warehouse practice based on Hadoop + Hive + HDFS + DataX + MySQL, covering member metrics testing (active/new/retention), HDFS export, DataX sync to MySQL, and advertising business ODS/DWD/ADS full process modeling.
Offline Data Warehouse Practice: Flume+HDFS+Hive Building ODS/DWD/DWS/ADS Member Analysis Pipeline
Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered processing, supporting active members, new members, member retention metrics calculation.
Flink Installation & Deployment: Local, Standalone, YARN Modes
Complete tutorial for Apache Flink installation and deployment in three modes: Local, Standalone cluster, and YARN integration, including environment configuration, parameter tuning, and common issue solutions.
Flink on YARN Deployment: Environment Preparation, Resource Application & Job Submission
Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and job submission process.
Offline Data Warehouse Hive ADS Export MySQL DataX Practice: Configuration and Pitfalls
The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter. Focuses on DataX JSON configuration and common error fixes.
Offline Data Warehouse Retention Rate Implementation: DWS Detail Modeling + ADS Aggregation
Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dws_member_retention_day table to join new member and startup detail tables to...
Offline Data Warehouse Hive New Member & Retention: DWS Detail + ADS Summary One-Pass
Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'. Use 'full member table (with first day dt)' as deduplication anchor, DWS outputs new member details, ADS outputs new member count.
Apache Flink Introduction: Unified Stream-Batch Real-Time Computing Engine
Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch processing model, and comparison with Spark Streaming for technology selection.
Offline Data Warehouse Hive Practice: DWD to DWS Daily/Weekly/Monthly Active Member ADS Metrics Implementation
This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly). Covers the complete flow from DWD...
Hive ODS Layer JSON Parsing: UDF Array Extraction, explode and JsonSerDe
JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...
Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing
Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...
Spark Streaming Integration with Kafka: Receiver and Direct Mode Complete Analysis
Detailed explanation of two Spark Streaming integration modes with Kafka: Receiver-based high-level API vs Direct mode architecture differences, offset management, Exactly-Once semantics guarantee, and complete Scala code implementation.
Flume Taildir + Custom Interceptor: Extract JSON Timestamp to Write HDFS Daily Partitions
Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...
Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection with logtime/logtype HDFS Partitioning
Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...
Flume Optimization for Offline Data Warehouse: batchSize, Channel, Compression and OOM Fix
Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...
Spark DStream Transformation Operators: map, reduceByKey, transform Practice
Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blacklist filtering: leftOuterJoin, SQL, and broadcast variables.
Spark Streaming Window Operations & State Tracking: updateStateByKey and mapWithState
In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-state maintenance and mapWithState incremental optimization, with complete Scala code.
Offline Data Warehouse Member Metrics Practice
Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.
How to Build an Offline Data Warehouse: Tracking → Metric System → Theme Analysis, Full Chain Guide
Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.
Offline Data Warehouse Architecture Selection and Cluster Sizing: Apache vs CDH/HDP, Full Component List + Naming Standards
Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...
Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Architecture Guide
When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...
Offline Data Warehouse Modeling Practice
Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.
Spark Streaming Introduction: From DStream to Structured Streaming Evolution
Introduction to Spark's two generations of real-time computing frameworks: DStream micro-batch processing model's architecture and limitations, and how Structured Streaming solves EventTime processing and API consistency issues through unbounded table model and Catalyst optimization.
Spark Streaming Data Sources: File Stream, Socket, RDD Queue Stream
Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation, with complete Scala code examples.
Grafana 11.3.0 Installation & Startup: YUM Install RPM, systemd Management, Login & Common Pitfalls
For OPs/devs still using CentOS/RHEL (including compatible distributions) in 2026, provides Grafana 11.3.0 (grafana-enterprise-11.3.0-1.x86_64.rpm) direct YUM...
Data Warehouse Introduction: Four Characteristics, OLTP vs OLAP Differences & Enterprise Data Warehouse Architecture
2026 engineering practice, covering core concepts and implementation concerns for data warehouses: starting from enterprise data silos, explaining four...
Prometheus 2.53.2 Installation & Configuration Practice: Scrape Targets, Exporter, Alert Chain & Common Troubleshooting
Prometheus 2.53.2 (still common in existing environments in 2025/2026) provides a reusable deployment process: download and extract binary on monitoring...
Prometheus Node Exporter 1.8.2 + Pushgateway 1.10.0: Download, Start, Integration & Pitfalls
Common Prometheus monitoring落地场景: Install node_exporter-1.8.2 on Rocky Linux (CentOS/RHEL compatible) to expose host metrics, integrate with Prometheus...
sklearn KMeans Key Attributes & Evaluation: cluster_centers_, inertia_, Silhouette Score for K Selection
scikit-learn (sklearn) KMeans (2026) explains three most commonly used objects: cluster_centers_ (cluster centers), inertia_ (Within-Cluster Sum of Squares),...
KMeans n_clusters Selection: Silhouette Score Practice + init/n_init/random_state Version Pitfalls (scikit-learn 1.4+)
KMeans n_clusters selection method: calculate silhouette_score and silhouette_samples on candidate cluster numbers (e.g., 2/4/6/8), determine optimal k by...
SparkSQL Statements: DataFrame Operations, SQL Queries & Hive Integration
Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for metadata and table operations.
SparkSQL Kernel: Five Join Strategies & Catalyst Optimizer Analysis
Deep dive into SparkSQL's five Join execution strategies (BHJ, SHJ, SMJ, Cartesian, BNLJ) selection conditions and use cases, along with the complete processing flow of Catalyst optimizer from SQL parsing to code generation.
Python Hand-Written K-Means Clustering on Iris Dataset: From Distance Function to Iterative Convergence & Pitfalls
Python K-Means clustering implementation: using NumPy broadcasting to compute squared Euclidean distance (distEclud), initializing centroids via uniform...
K-Means Clustering Practice: Self-Implemented Algorithm Verification + sklearn KMeans Parameters/labels_/fit_predict Quick Guide
K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification...
Scikit-Learn Logistic Regression Implementation: max_iter, Classification Method & Multiclass Optimization
When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy. If training doesn't...
K-Means Clustering Guide: From Unsupervised Concepts to Inertia, K Selection & Pitfalls
K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed), with engineering applications in customer...
Deep Understanding of Logistic Regression & Gradient Descent Optimization Algorithm
Logistic Regression (LR) is an important classification algorithm in machine learning, widely used in binary classification tasks like sentiment analysis,...
How to Implement Logistic Regression in Scikit-Learn and Regularization Detailed (L1 and L2)
As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8, training...
SparkSQL Core Abstractions: RDD, DataFrame, Dataset & SparkSession
Deep comparison of Spark's three data abstractions RDD, DataFrame, Dataset features and use cases, introduction to SparkSession unified entry, and demonstration of mutual conversion methods between abstractions.
SparkSQL Operators: Transformation & Action Operations
Systematically review SparkSQL Transformation and Action operators, covering select, filter, join, groupBy, union operations, with practical test cases demonstrating usage and performance optimization.
How to Handle Multicollinearity: Common Problems & Solutions in Least Squares Linear Regression
When using scikit-learn for linear regression, how to handle multicollinearity in least squares method. Multicollinearity may cause instability in regression...
Ridge Regression and Lasso Regression: Differences, Applications and Selection Guide
Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine...
Linear Regression Machine Learning Perspective: Matrix Representation, SSE Loss & Least Squares
Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown; use loss function to characterize...
NumPy Matrix Multiplication Hand-written Multivariate Linear Regression: Normal Equation, SSE/MSE/RMSE & R²
pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation). Core idea is to form normal...
sklearn Decision Tree Pruning Parameters: max_depth/min_samples_leaf to min_impurity_decrease
Common parameters for decision tree pruning (pre-pruning) in engineering: max_depth, min_samples_leaf, min_samples_split, max_features, min_impurity_decrease...
Confusion Matrix to ROC: Complete Review of Imbalanced Binary Classification Evaluation Metrics
Confusion matrix (TP, FP, FN, TN) establishes unified口径, explains business meaning of Accuracy, Precision (查准率), Recall (查全率/敏感度), F1 Measure: Precision...
Spark Standalone Mode: Architecture & Performance Tuning
Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and RDD optimization strategies.
SparkSQL Introduction: SQL & Distributed Computing Fusion
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integration with Hive/HDFS.
Decision Tree from Split to Pruning: Information Gain/Gain Ratio, Continuous Variables & CART Essentials
Complete chain from 'split' to 'pruning', explain why usually uses greedy algorithm to form 'local optimum', and differences in splitting criteria between...
sklearn Decision Tree Practice: criterion, Graphviz Visualization & Pruning to Prevent Overfitting
Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version). Focus on...
Decision Tree Model Detailed: Node Structure, Conditional Probability Perspective & Shannon Entropy Calculation
Decision Tree model systematic overview for classification tasks: three types of nodes (root/internal/leaf), recursive split flow from root to leaf, and...
Decision Tree Information Gain Detailed: Information Entropy, ID3 Feature Selection & Python Optimal Split Implementation
Decision tree information gain (Information Gain) explained, first using information entropy (Entropy) to explain impurity, then explaining why when splitting...
K-Fold Cross-Validation Practice: sklearn Look at Mean/Variance, Select More Stable KNN Hyperparameters
Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation. Through sklearn's cross_val_score to...
KNN Must Normalize First: Min-Max Correct Method, Data Leakage Pitfall & sklearn Implementation
In scikit-learn machine learning training pipeline, distance-based models like KNN are extremely sensitive to inconsistent feature scales: Euclidean distance...
Spark RDD Fault Tolerance: Checkpoint Principle & Best Practices
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long dependency chain scenarios.
Spark Broadcast Variables: Efficient Shared Read-Only Data
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices, and performance optimization solution using broadcast to implement MapSideJoin instead of shuffle join.
KNN/K-Nearest Neighbors Algorithm Practice: Euclidean Distance + Voting Mechanism Handwritten Implementation, with Visualization & Tuning Points
KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python...
scikit-learn KNN Practice: KNeighborsClassifier, kneighbors & Learning Curve to Select Optimal
From unified API (fit/predict/transform/score) to kneighbors to find K nearest neighbors of test samples, then using learning curve/parameter curve to select...
Apache Tez Practice: Hive on Tez Installation & Configuration, DAG Principles & Common Pitfalls
Apache Tez (example version Tez 0.9.x) as execution engine alternative to MapReduce on Hadoop2/YARN, providing DAG (Directed Acyclic Graph) execution model for...
Data Mining: From Wine Classification to Machine Learning Overview - Supervised/Unsupervised/Reinforcement Learning, Feature Space & Overfitting
2025's most commonly used machine learning concept framework: supervised learning (classification/regression), unsupervised learning (clustering/dimensionality...
Elasticsearch Cluster Planning & Tuning: Node Roles, Shard/Replica, Write & Search Optimization
Master / Data / Coordinating node responsibilities and production role isolation strategies, capacity planning calculations (JVM Heap 30-32GB limit, hot/cold data with disk/IO constraints, horizontal scaling path), plus shard and replica as core knobs for performance and reliability.
DataX 3.0 Architecture & Practice: Reader/Writer Plugin Model, Job/TaskGroup Scheduling, speed/errorLimit Config
DataX (DataX 3.0) is an offline data synchronization/data integration tool widely used and open-sourced within Alibaba, for enterprise-level heterogeneous data...
Spark Super WordCount: Text Cleaning & MySQL Persistence
Implement complete production-ready word frequency pipeline: lowercase conversion, punctuation removal, stop word filtering, word frequency counting, finally efficiently write to MySQL via foreachPartition, comparing row-by-row insert vs partition batch write performance.
Spark Serialization & RDD Execution Principle
Deep dive into Spark Driver-Executor process communication, Java/Kryo serialization selection, closure serialization problem troubleshooting, and RDD dependencies, Stage division and persistence storage levels.
Nginx JSON Logs to ELK: ZK+Kafka+Elasticsearch 7.3.0+Kibana Practice Setup
Configure Nginx log_format json to output structured access_log (containing @timestamp, request_time, status, request_uri, ua and other fields), start...
Filebeat → Kafka → Logstash → Elasticsearch Practice
Filebeat collects Nginx access.log and writes to Kafka, Logstash consumes from Kafka and parses message embedded JSON by field (app/type) conditions, adds...
Logstash Filter Plugin Practice: grok Parsing Console & Nginx Logs (7.3.0 Config Reusable)
Article explains using grok in Logstash 7.3.0 environment to extract structured fields from console stdin and Nginx access logs (IP, time_local, method, request, status etc), and quickly verify parsing effect through stdout { codec => rubydebug }.
Logstash Output Plugin Practice: stdout/file/Elasticsearch Output Config & Tuning
Logstash Output plugin (Logstash 7.3.0) practical tutorial, covering stdout (rubydebug) for debugging, file output for local archiving, Elasticsearch output...
Logstash 7 Getting Started: stdin/file Collection, sincedb/start_position Mechanism & Troubleshooting
Logstash 7 getting started tutorial, covering stdin/file collection, sincedb mechanism and start_position effect conditions, with error quick reference table
Logstash JDBC vs Syslog Input: Principle, Scenario Comparison & Reusable Config (Based on Logstash 7.3.0)
Logstash Input plugin comparison, breakdown technical differences between JDBC Input and Syslog collection pipeline, applicable scenarios and key configs. JDBC...
Spark Scala WordCount Implementation
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Spark Scala Practice: Pi Estimation & Mutual Friends
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two approaches, comparing Cartesian product vs data transformation performance differences.
Elasticsearch Concurrency Conflicts & Optimistic Lock, Distributed Data Consistency Analysis
Elasticsearch concurrency conflicts (inventory deduction read-modify-write) breakdown write overwrite cause, and gives engineering solution using ES optimistic...
Elasticsearch Doc Values Mechanism Detailed: Columnar Storage Supporting Sort/Aggregation/Script
Disk columnar data structure generated at indexing time, optimized for sorting, aggregation and script values; most supported types enabled by default, text fields don't provide doc values by default, need to use keyword subfield or enable fielddata for aggregation/sorting.
Elasticsearch Segment Merge & Disk Directory Breakdown: Merge Policy, Force Merge, Shard File Structure
Explains why refresh causes small segment increase, how segment merge merges small segments into large ones in background and cleans deleted documents, why too...
Elasticsearch Inverted Index Underlying Breakdown: Terms Dictionary, FST, SkipList & Lucene Index Files
Article details core data structure of Elasticsearch inverted index: Terms Dictionary, Posting List, FST (Finite State Transducer) and SkipList how accelerate...
Elasticsearch Inverted Index & Read/Write Process Full Analysis: From Lucene Principles to Query/Fetch Practice
Article analyzes Elasticsearch inverted index principle based on Lucene, compares forward index vs inverted index differences, covering core concepts like...
Elasticsearch Near Real-time Search: Segment, Refresh, Flush, Translog Full Process Analysis
Article details core mechanism of Elasticsearch near real-time search, including Lucene Segment, Memory Buffer, File System Cache, Refresh, Flush and Translog...
Spark Action Operations Overview
Comprehensive introduction to Spark RDD Action operations, covering data collection, statistical aggregation, element retrieval, storage output categories, and detailed explanation of Key-Value RDD core operators like groupByKey, reduceByKey, join.
Elasticsearch Aggregation Practice: Metrics Aggregations & Bucket Aggregations Complete Usage & DSL Analysis
Covers complete practice of Metrics Aggregations and Bucket Aggregations, applicable to common Elasticsearch 7.x / 8.x versions in 2025. Article starts with...
Elasticsearch 7.3 Java Practice: Index & Document CRUD Full Process Examples
elasticsearch-rest-high-level-client implements index and document CRUD, including: create index via JSON and XContentBuilder two ways, config shards and replicas, delete index, insert single document, query document by ID and use match_all to query all data.
Elasticsearch Term Exact Query & Bool Combination Practice: range/regexp/fuzzy Full Examples
This article demonstrates Elasticsearch term-level queries including term, terms, range, exists, prefix, regexp, fuzzy, ids queries, and bool compound queries. Covers creating book index, inserting sample data, various query DSL examples and execution results.
Elasticsearch Filter DSL Full Practice: Filter Query, Sort Pagination, Highlight & Batch Operations
This article introduces Filter DSL vs query difference: Filter DSL doesn't calculate relevance score, specifically optimized for filter scenario execution...
Elasticsearch Mapping & Document CRUD Practice (Based on 7.x/8.x)
This article details Elasticsearch 7.x/8.x mapping config and document CRUD operations, including index/field mapping creation, mapping properties (type, index, store, analyzer), document create, query, full/partial update, delete by ID or condition.
Elasticsearch Query DSL Practice: match/match_phrase/query_string/multi_match Full Analysis
In-depth explanation of core Query DSL usage in Elasticsearch 7.3, focusing on differences and pitfalls of match, match_phrase, query_string, multi_match and other full-text search statements in real business scenarios.
Spark Cluster Architecture & Deployment Modes
Deep dive into Spark cluster core components Driver, Cluster Manager, Executor responsibilities, comparison of Standalone, YARN, Kubernetes deployment modes, and static vs dynamic resource allocation strategies.
Elasticsearch-Head & Kibana 7.3.0 Practice: Installation Points, Connectivity & Common Pitfalls
Introduction to Elasticsearch-Head plugin and Kibana 7.3.0 installation and connectivity points, covering Chrome extension quick access, ES cluster health and...
Elasticsearch Index Operations & IK Analyzer Practice: 7.3/8.15 Full Process Quick Reference
Elasticsearch index create, existence check (single/multi/all), open/close/delete and health troubleshooting, as well as IK analyzer installation, ik_max_word/ik_smart analysis and Nginx hosting scheme for remote extended dictionary/stop words.
Elasticsearch Getting Started: Index/Document CRUD & Search Minimum Examples
Elasticsearch (ES 7.x/8.x) minimum examples: Create index, insert document, query by ID, update and _search search flow, with return samples and screenshots, help readers complete 'index/document CRUD' run-through in 3-10 minutes.
Elasticsearch 7.3.0 Three-Node Cluster Practice: Directory/Parameter/Startup to Online
Elasticsearch 7.3.0 three-node cluster deployment practice tutorial, covering directory creation and permission settings, system parameter config...
ELK Elastic Stack (ELK) Practice: Architecture Key Points, Indexing & Troubleshooting Checklist
Article introduces core capabilities and common practices of Elasticsearch 8.x, Logstash 8.x, Kibana 8.x, covering key aspects of centralized logging system: collection, transmission, indexing, shard/replica, query DSL, aggregation and ILM lifecycle management.
Elasticsearch Single Machine Cloud Server Deployment & Operation Detailed Process
Elasticsearch is a distributed full-text search engine, supports single-node mode and cluster mode deployment. Generally, small companies can use Single-Node Mode for their business scenarios.
Apache Kylin Cube7 Practice: Aggregation Group/RowKey/Encoding & Size Precision Comparison
Covers Aggregation Group, Mandatory Dimension, Hierarchy Dimension, Joint Dimension usage trade-offs, and explains impact of dictionary encoding, RowKey order, ShardBy sharding on build and query performance with CubeStatsReader precision/sparsity readings and RowKey/HBase storage model.
Apache Kylin 1.6 Streaming Cubing Practice: Kafka to Minute-level OLAP
Kafka→Kylin real-time OLAP pipeline, providing minute-level aggregation queries for common 2025 business scenarios (e-commerce transactions, user behavior, IoT monitoring).
Spark RDD Deep Dive: Five Key Features
Comprehensive analysis of Spark core data abstraction RDD's five key features (partitions, compute function, dependencies, partitioner, preferred locations), lazy evaluation, fault tolerance, and narrow/wide dependency principles.
Spark RDD Creation & Transformation Operations
Detailed explanation of three RDD creation methods (parallelize, textFile, transform from existing RDD), and usage of common Transformation operators like map, filter, flatMap, groupBy, sortBy with lazy evaluation principles.
Apache Kylin Segment Merge Practice: Manual/Auto Merge, Retention Policy & JDBC Examples
Apache Kylin Segment merge practice tutorial, covering manual MERGE Job flow, continuous Segment requirements, Auto Merge multi-level threshold strategy, Retention Threshold cleanup logic, deletion flow (Disable→Delete) and JDBC connection query examples.
Apache Kylin Cuboid Pruning Practice: Derived Dimensions & Expansion Rate Control
Cuboid pruning optimization: When there are many dimensions, Cuboid count grows exponentially, causing long build time and storage expansion. Engineering...
Apache Kylin Cube Practice: Complete Guide for Modeling, Build and Query Acceleration
Apache Kylin 4.0 Cube modeling and query acceleration method: Complete star modeling with fact tables and dimension tables, design dimensions and measures, use...
Apache Kylin Incremental Cube & Segment Practice: Daily Partition Incremental Build Guide
Using date field of Hive partitioned table as Partition Date Column, split Cube into multiple Segments, incrementally build by range to avoid repeated computation of historical data; also compare full build vs incremental build differences in query paths.
Apache Kylin Cube Practice: Hive Load & Pre-computation Acceleration (With Cuboid/Real-time OLAP, Kylin 4.x)
OLAP example: Generate dimension and fact data via Python, after Hive (wzk_kylin) load, design Cube in Kylin (dimensions/measures/Cuboids), and provide...
Apache Kylin Cube Practice: From Modeling to Build and Query (With Pitfalls & Optimization)
Apache Kylin (3.x/4.x) Cube setup and optimization: complete flow from DataSource → Model → Cube, covering dimension modeling, measure design, Cuboid...
From MapReduce to Spark: Big Data Computing Evolution
Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core components.
Apache Kylin Comprehensive Guide: MOLAP Architecture, Hive/Kafka Practice & Real-time OLAP
Background, evolution and engineering practice of Apache Kylin, focusing on MOLAP solution implementation path for massive data analysis. Core keywords: Apache...
Apache Kylin 3.1.1 Deployment on Hadoop 2.9/Hive 2.3/HBase 1.3 (With Pitfalls & Fixes)
Complete deployment record of Apache Kylin 3.1.1 on Hadoop 2.9.2, Hive 2.3.9, HBase 1.3.1, Spark 2.4.5 (without-hadoop, Scala 2.12) and three-node...
Kafka Storage Mechanism: Log Segmentation & Retention
Deep analysis of Kafka log storage architecture, including LogSegment design, sparse offset index and timestamp index principles, message lookup flow, and log retention and cleanup strategy configuration.
Kafka High Performance: Zero-Copy, mmap & Sequential Write
Deep dive into Kafka's three I/O technologies achieving high throughput: sendfile zero-copy, mmap memory mapping and page cache sequential write, revealing kernel-level optimization behind million messages per second.
Kafka Replica Mechanism: ISR & Leader Election
Deep dive into Kafka replica mechanism, including ISR sync node set maintenance, Leader election process, and unclean election trade-offs between consistency and availability.
Kafka Exactly-Once: Idempotence & Transactions
Systematic explanation of how Kafka achieves Exactly-Once semantics through idempotent producers and transactions, covering PID/sequence number principle, cross-partition transaction configuration and end-to-end EOS implementation.
Apache Druid Storage & Query Architecture: Segment/Chunk/Roll-up/Bitmap Explained
Apache Druid data storage and high-performance query path: from DataSource/Chunk/Segment layering, to columnar storage, Roll-up pre-aggregation, Bitmap...
Apache Druid + Kafka Real-time Analysis: JSON Flattening Ingestion & SQL Metrics Full Process
Scala Kafka Producer writes order/click data to Kafka Topic (example topic: druid2), continuous ingestion in Druid through Kafka Indexing Service. Since...
Apache Druid Real-time Kafka Ingestion: Complete Practice from Ingestion to Query
Complete practice of Apache Druid real-time Kafka ingestion, using network traffic JSON as example, completing data ingestion through Druid console's Streaming/Kafka wizard, parsing time column, setting dimensions and metrics, and verifying results with SQL.
Apache Druid Architecture & Component Responsibilities: Coordinator/Overlord/Historical Deep Dive
Apache Druid component responsibilities and deployment points from 0.13.0 to current (2025): Coordinator manages Historical node Segment...
Apache Druid Cluster Deployment [Part 1]: MySQL Metadata + HDFS Deep Storage & Low-Config Tuning
Apache Druid 30.0.0 deployable solution covering MySQL metadata storage (mysql-connector-java 8.0.19), HDFS deep storage and HDFS indexing-logs, plus Kafka...
Apache Druid Cluster Mode [Part 2]: Low-Memory Cluster Practice: JVM/DirectMemory & Startup Scripts
Low-memory cluster practice for Apache Druid 30.0.0 on three nodes: provides JVM parameters and runtime.properties key items for Broker/Historical/Router, explains off-heap memory and processing buffer ratio relationship.
Kafka Topic, Partition & Consumer: Rebalance Optimization
Deep dive into Kafka Topic, Partition, Consumer Group core mechanisms, covering custom deserialization, offset management and rebalance optimization configuration.
Kafka Topic Management: Commands & Java API
Comprehensive introduction to Kafka Topic operations, including kafka-topics.sh commands, replica assignment strategy principles, and KafkaAdminClient Java API core usage.
Apache Druid Real-time OLAP Architecture & Selection Points
Apache Druid real-time OLAP practice: suitable for event detail with time as primary key, sub-second aggregation and high-concurrency self-service analysis.
Apache Druid Single-Machine Deployment: Architecture Overview, Startup Checklist & Quick Troubleshooting
Apache Druid 30.0.0 for single-machine quick verification and engineering implementation, systematically reviewing Druid architecture (Coordinator, Historical,...
Flink Write to Kudu Practice: Custom Sink Full Process (Flink 1.11/Kudu 1.17/Java 11)
Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test). Through RichSinkFunction custom sink,...
Kafka Producer Interceptor & Interceptor Chain
Introduction to Kafka 0.10 Producer interceptor mechanism, covering onSend and onAcknowledgement interception points, interceptor chain execution order and error isolation, with complete custom interceptor implementation.
Kafka Consumer: Consumption Flow, Heartbeat & Parameter Tuning
Detailed explanation of Kafka Consumer Group consumption model, partition assignment strategy, heartbeat keep-alive mechanism, and tuning practices for key parameters like session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms.
Apache Kudu Docker Quick Deployment: 3 Master/5 TServer Practice & Pitfalls Quick Reference
Apache Kudu Docker Compose quick deployment solution on Ubuntu 22.04 cloud host, covering Kudu Master and Tablet Server components,...
Java Access Apache Kudu: Table Creation to CRUD (Including KuduSession Flush Mode & Multi-Master Config)
Java client (kudu-client 1.4.0) connects to Apache Kudu with multiple Masters (example ports 7051/7151/7251), completes full process of table creation, insert,...
Apache Kudu: Real-time Write + OLAP Architecture, Performance & Integration
Apache Kudu in 2025 version and ecosystem integration: Latest Kudu 1.18.0 (2025/07) released, bringing segmented LRU Block Cache and RocksDB-based metadata...
Apache Kudu Architecture & Practice: RowSet, Partition & Raft Deep Dive
Apache Kudu's Master/TabletServer architecture, RowSet (MemRowSet/DiskRowSet) write/read path, MVCC, and Raft consensus role in replica and failover; provides...
ClickHouse MergeTree Partition/TTL, Materialized View, ALTER & system.parts Full Process Example
ClickHouse beginner and operations practice, based on real cluster (h121/h122/h123) demonstrating complete process from connection to database/table creation,...
Kafka Producer Message Sending Flow & Core Parameters
Deep analysis of Kafka Producer initialization, message interception, serialization, partition routing, buffer batch sending, ACK confirmation and complete sending chain, with key parameter tuning suggestions.
Kafka Serialization & Partitioning: Custom Implementation
Deep dive into Kafka message serialization and partition routing, including complete code for custom Serializer and Partitioner, mastering precise message routing and efficient transmission.
ClickHouse Replica Deep Dive: ReplicatedMergeTree + ZooKeeper from 0-1
ClickHouse replica full chain: ZK/Keeper preparation, macros configuration, ON CLUSTER consistent table creation, write deduplication & replication mechanism,...
ClickHouse Sharding × Replica × Distributed: ReplicatedMergeTree, Keeper, insert_quorum & Load Balancing
ClickHouse sharding × replica × Distributed architecture: Based on ReplicatedMergeTree + Distributed, using ON CLUSTER one-click table creation on 3-shard ×...
ClickHouse MergeTree Best Practices: Replacing Deduplication, Summing Aggregation, Partition Design & Materialized View Alternatives
ClickHouse two light aggregation engines ReplacingMergeTree and SummingMergeTree, combined with minimum runnable examples (MRE) and comparative queries,...
ClickHouse CollapsingMergeTree & External Data Sources: HDFS/MySQL/Kafka Integration
ClickHouse external data source engine minimum feasible solution: DDL templates, key parameters and read/write链路 for ENGINE=HDFS, ENGINE=MySQL, ENGINE=Kafka.
ClickHouse MergeTree Practical Guide: Partition, Sparse Index & Merge Mechanism
ClickHouse MergeTree key mechanisms: batch writes form parts, background merge (Compact/Wide two part forms), ORDER BY is sparse primary index,...
ClickHouse MergeTree Deep Dive: Partition Pruning × Sparse Primary Index × Marks × Compression
ClickHouse MergeTree storage and query path: column files (*.bin), sparse primary index (primary.idx), marker files (.mrk/.mrk2) and index_granularity...
Kafka Operations: Shell Commands & Java Client Examples
Covers Kafka daily operations: daemon startup, Shell topic management commands, and Java client programming (complete Producer/Consumer code) with key configuration parameters and ConsumerRebalanceListener usage.
Spring Boot Integration with Kafka
Detailed guide on integrating Kafka in Spring Boot projects, including dependency configuration, KafkaTemplate sync/async message sending, and complete @KafkaListener consumption practice.
Spark Distributed Environment Setup
Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.sh core config adjustments, and complete multi-node cluster distribution and startup.
ClickHouse Cluster Connectivity Self-Check & Data Types Guide | Run ON CLUSTER in 10 Minutes
Using three-node cluster (h121/122/123) as example, first complete cluster connectivity self-check: system.clusters validation → ON CLUSTER create...
ClickHouse Table Engines: TinyLog/Log/StripeLog/Memory/Merge Selection Guide
Sort through ClickHouse table engines: TinyLog, Log, StripeLog, Memory, Merge principles, applicable scenarios and pitfalls, provide reproducible minimum...
Kafka Components: Producer, Broker, Consumer Full Flow
Deep dive into Kafka's three core components: Producer partitioning strategy and ACK mechanism, Broker Leader/Follower architecture, Consumer Group partition assignment and offset management.
Kafka Installation: From ZooKeeper to KRaft Evolution
Introduction to Kafka 2.x vs 3.x core differences, detailed cluster installation steps, ZooKeeper configuration, Broker parameter settings, and how KRaft mode replaces ZooKeeper dependency.
ClickHouse Concepts & Basics | Why Fast? Columnar + Vectorized + MergeTree Comparison
For high-concurrency, low-latency OLAP scenarios, this article explains ClickHouse's underlying advantages (columnar+compression+vectorized, MergeTree family),...
ClickHouse Single Machine + Cluster Node Deployment Guide | Installation Configuration | systemd Management / config.d
Official recommended keyring + signed-by installation of ClickHouse on Ubuntu, start with systemd and self-check; provides single machine minimum example...
Flink CEP Practice: 24 Hours ≥5 Transactions & 10 Minutes Unpaid Detection Cases
Flink CEP (Complex Event Processing) complex event processing mechanism, combined with actual cases to deeply explain its application principles and practical...
Flink SQL Quick Start | Table API + SQL in 3 Minutes with toChangelogStream
Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE), Table API and...
Flink CEP Deep Dive: Complex Event Processing Complete Guide
Flink CEP is the core component for real-time analysis of complex event streams in Flink, providing a complete pattern matching framework, supporting...
Flink CEP Timeout Event Extraction: Complete Guide with Malicious Login Detection Case
Flink CEP timeout event extraction is a key环节 in stream processing, used to capture partial matching events that exceed the window time (within) during pattern...
Redis High Availability: Master-Slave Replication & Sentinel
Deep dive into Redis high availability: master-slave replication, Sentinel automatic failover, and distributed lock design with Docker deployment examples.
Kafka Architecture: High-Throughput Distributed Messaging
Systematic introduction to Kafka core architecture: Topic/Partition/Replica model, ISR mechanism, zero-copy optimization, message format and typical use cases.
Flink StateBackend Deep Dive: Memory, Fs, RocksDB & OperatorState Management
ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale. Developers can use ManagedOperatorState by implementing CheckpointedFunction interface, supporting ListState and BroadcastState two data structures.
Flink Parallelism Deep Dive: From Concepts to Best Practices
In Flink, Parallelism is the core parameter measuring task concurrent processing capability, determining the number of tasks that can run simultaneously for...
Flink Broadcast State: Dynamic Logic Updates in Real-time Stream Computing
Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications, widely used in real-time risk control,...
Flink State Backend: Memory, Fs, RocksDB & Performance Differences
State Storage (State Backend) is the core mechanism for implementing stateful stream computing in Flink, determining data reliability, performance and fault...
Flink Parallelism Setting Priority: Principles, Configuration & Best Practices
A Flink program consists of multiple Operators (Source, Transformation, Sink). An Operator is executed by multiple parallel Tasks (threads), and the number of...
Flink State: Keyed State, Operator State & KeyGroups Working Principles
Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap...
Redis Cache Problems: Penetration, Breakdown, Avalanche, Hot Key and Big Key
Systematic overview of the five most common Redis cache problems in high-concurrency scenarios: cache penetration, cache breakdown, cache avalanche, hot key, and big key. Analyzes the root cause of each problem and provides actionable solutions.
Redis Distributed Lock: Optimistic Lock, WATCH and SETNX with Lua and Java
Redis optimistic lock in practice: WATCH/MULTI/EXEC mechanism explained, Lua scripts for atomic operations, SETNX+EXPIRE distributed lock from basics to Redisson, with complete Java code examples.
Flink Time Semantics: EventTime, ProcessingTime, IngestionTime & Watermark Mechanism
Watermark is a special marker used to tell Flink the progress of events in the data stream. Simply put, Watermark is the 'current time' estimated by Flink in...
Flink Watermark Complete Guide: Event Time Window, Out-of-Order & Late Data
Flink's Watermark mechanism is one of the most core concepts in event time window computation, used for handling out-of-order events and ensuring accurate...
Flink Window Complete Guide: Tumbling, Sliding, Session
Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture. Flink treats batch as a special case of stream processing, using time windows (Tumbling, Sliding, Session) and count windows to split infinite streams into finite datasets.
Flink Sliding Window Deep Dive: Principles, Use Cases & Implementation Examples
Sliding Window is one of the core mechanisms in Apache Flink stream processing, more flexible than fixed windows, widely used in real-time monitoring, anomaly...
Flink JDBC Sink Deep Dive: MySQL Real-time Write, Batch Optimization & Best Practices
JDBC Sink is one of the most commonly used data output components, often used to write stream and batch processing results to relational databases like MySQL,...
Flink Batch Processing DataSet API: Use Cases, Code Examples & Optimization Mechanisms
Flink's DataSet API is the core programming interface for batch processing, designed for processing static, bounded datasets, supporting TB to PB scale big...
Redis Memory Management: Key Expiration and Eviction Policies
Comprehensive analysis of Redis memory control mechanisms, including maxmemory configuration, three key expiration deletion strategies (lazy/active/scheduled), and 8 memory eviction policies with applicable scenarios and selection guidance.
Redis Communication Internals: RESP Protocol and Reactor Event-Driven Model
Deep dive into Redis communication internals: RESP serialization protocol five data types, Pipeline batch processing mode, and how the epoll-based Reactor single-threaded event-driven architecture supports Redis high-concurrency processing capability.
Flink DataStream Transformation: Map, FlatMap, Filter to Window Complete Guide
Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios. Common operators include Map, FlatMap and...
Flink Sink Usage Guide: Types, Fault Tolerance Semantics & Use Cases
Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media. It is the endpoint of streaming applications, determining how data is saved, transmitted or consumed.
Flink Source Operator Deep Dive: Non-Parallel Source Principles & Use Cases
Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are processed sequentially.
Flink SourceFunction to RichSourceFunction: Enhanced Source Functions & Practical Examples
RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.
Flink on YARN Deployment: Environment Variables, Configuration & Resource Application
Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations. First, configure environment...
Flink DataStream API: DataSource, Transformation & Sink Complete Guide
DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.
Redis Persistence: RDB vs AOF Comparison and Production Strategy
Systematic comparison of Redis two persistence solutions: RDB snapshot and AOF log — configuration methods, trigger mechanisms, pros and cons, AOF rewrite mechanism, and recommended strategies for production environments.
Redis RDB Persistence: Snapshot Principles, Configuration and Trade-offs
In-depth analysis of Redis RDB persistence mechanism, covering trigger methods, BGSAVE execution flow, configuration parameters, file structure, and comparison with AOF, helping you make informed persistence decisions in production environments.
Flink Architecture Deep Dive: JobManager, TaskManager & Core Roles Overview
Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components. JobManager as Master is...
Flink Installation & Deployment Guide: Local, Standalone & YARN Modes
Flink provides multiple installation modes to suit different scenarios. Local mode is suitable for personal learning and small-scale debugging with simple...
Apache Flink Deep Dive: From Origin to Technical Features
Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data. With 'unified...
Flink Stream-Batch Integration Introduction: Concept Analysis & WordCount Code Practice
Apache Flink supports both stream processing and batch processing. Stream processing is suitable for real-time data like sensors, logs or trading streams,...
Redis Lua Scripts: EVAL, redis.call and Atomic Operations in Practice
Systematic explanation of Redis Lua script EVAL command syntax, differences between redis.call and redis.pcall, and four typical practical cases: atomic counter, CAS (Compare-And-Swap), batch operations, and distributed lock implementation using Lua scripts.
Redis Slow Query Log and Performance Tuning in Production
Detailed explanation of Redis slow query log configuration parameters (slowlog-log-slower-than, slowlog-max-len), core commands, and production-grade performance tuning strategies including data structure optimization, Pipeline usage, and monitoring system setup.
Spark Streaming Kafka Consumption: Offset Acquisition, Storage & Recovery Details
When Spark Streaming integrates with Kafka, Offset management is key to ensuring data processing continuity and consistency. Offset marks message position in...
Spark Streaming Integration with Kafka: Offset Management Mechanism Details & Best Practices
Offset is used to mark message position in Kafka partition. Proper management can achieve at-least-once or even exactly-once data processing semantics. By persisting Offset, application can resume consumption from last processed position during fault recovery, avoiding message loss or duplication.
Spark Streaming Stateful Transformations: Window Operations & State Tracking with Cases
Window operations integrate data from multiple batches over a longer time range by setting window length and slide duration. Cases demonstrate reduceByWindow...
Spark Streaming Integration with Kafka: Receiver and Direct Approaches with Code Cases
This article introduces two Spark Streaming integration methods with Kafka: Receiver Approach and Direct Approach. Receiver uses Executor-based Receiver to...
Redis Advanced Data Types: Bitmap, Geo and Stream
Deep dive into Redis three advanced data types: Bitmap, Geo (GeoHash, Z-order curve, Base32 encoding), and Stream message stream, with common commands and practical examples.
Redis Pub/Sub: Mechanism, Weak Transaction and Risks
Detailed explanation of Redis Pub/Sub working mechanism, three weak transaction flaws (no persistence, no acknowledgment, no retry), and alternative solutions in production.
Redis Single Node and Cluster Installation
Install Redis 6.2.9 from source on Ubuntu, configure redis.conf for daemon mode, start redis-server and verify connection via redis-cli.
Redis Five Data Types: Complete Command Reference and Practical Scenarios
Comprehensive explanation of Redis five data types: String, List, Set, Sorted Set, and Hash. Includes common commands,底层特性, and typical usage scenarios with complete command examples.
HBase Java API: Complete CRUD Code with Table Creation, Insert, Delete and Scan
Using HBase Java Client API to implement table creation, insert, delete, Get query, full table scan, and range scan. Includes complete Maven dependencies and runnable code examples covering all common HBase operations.
Redis Introduction: Features and Architecture
Introduction to Redis: in-memory data structure store, key-value database, with comparison to traditional databases and typical use cases.
HBase Cluster Deployment and High Availability Configuration
Complete HBase distributed cluster deployment: configure RegionServer on multiple nodes, HMaster high availability, integrate with ZooKeeper for coordination, with start/stop scripts and verification steps.
HBase Shell CRUD Operations and Data Model
HBase Shell commands: create table, Put/Get/Scan/Delete operations, explain HBase data model with practical examples.
HBase Overall Architecture: HMaster, HRegionServer and Data Model
Comprehensive analysis of HBase distributed database overall architecture, including ZooKeeper coordination, HMaster management node, HRegionServer data node, Region storage unit, and four-dimensional data model, suitable for big data architecture selection reference.
HBase Single Node Configuration: hbase-env and hbase-site.xml Details
Step-by-step configure HBase single node environment, explain hbase-env.sh, hbase-site.xml key parameters, complete integration with Hadoop HDFS and ZooKeeper cluster.
ZooKeeper Leader Election and ZAB Protocol Principles
Deep dive into ZooKeeper's Leader election mechanism and ZAB (ZooKeeper Atomic Broadcast) protocol, covering initial election process, message broadcast three phases, fault recovery strategy, and production deployment suggestions.
ZooKeeper Distributed Lock Java Implementation Details
Implement distributed lock based on ZooKeeper ephemeral sequential nodes, with complete Java code, covering lock competition, predecessor node monitoring, CountDownLatch synchronization, and recursive retry complete flow.
ZooKeeper Watcher Principle and Command Line Practice Guide
Complete analysis of Watcher registration-trigger-notification flow from client, WatchManager to ZooKeeper server, and zkCli command line practice demonstrating node CRUD and monitoring.
ZooKeeper Java API Practice: Node CRUD and Monitoring
Use ZkClient library to operate ZooKeeper via Java code, complete practical examples of session establishment, persistent node CRUD, child node change monitoring, and data change monitoring.
ZooKeeper Cluster Configuration Details and Startup Verification
Deep dive into zoo.cfg core parameter meanings, explain myid file configuration specifications, demonstrate 3-node cluster startup process and Leader election result verification.
ZooKeeper ZNode Data Structure and Watcher Mechanism Details
Deep dive into ZooKeeper's four ZNode node types, ZXID transaction ID structure, and one-time trigger Watcher monitoring mechanism principles and practice.
Sqoop Incremental Import and CDC Change Data Capture Principles
Introduce Sqoop's --incremental append incremental import mechanism, and deeply explain CDC (Change Data Capture) core concepts, capture method comparisons, and modern solutions like Flink CDC, Debezium.
ZooKeeper Distributed Coordination Framework Introduction and Cluster Deployment
Introduction to ZooKeeper core concepts, Leader/Follower/Observer role division, ZAB protocol principles, and demonstration of 3-node cluster installation and configuration process.
Sqoop Partial Import: --query, --columns, --where Three Filtering Methods
Detailed explanation of three ways Sqoop imports partial data from MySQL to HDFS by condition: custom query, specify columns, WHERE condition filtering, with applicable scenarios and precautions.
Sqoop and Hive Integration: MySQL ↔ Hive Bidirectional Data Migration
Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usage.
Sqoop Data Migration ETL Tool Introduction and Installation
Introduction to Apache Sqoop core principles, use cases, and installation configuration steps on Hadoop cluster, helping quickly get started with batch data migration between MySQL and HDFS/Hive.
Sqoop Practice: MySQL Full Data Import to HDFS
Complete example demonstrating Sqoop importing MySQL table data to HDFS, covering core parameter explanations, MapReduce parallel mechanism, and execution result verification.
Flume Collect Hive Logs to HDFS
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log data landing to HDFS.
Flume Dual Sink: Write Logs to Both HDFS and Local File
Through Flume replication mode (Replicating Channel Selector) and three-Agent cascade architecture, implement same log data written to both HDFS and local file, meeting both offline analysis and real-time backup needs.
Apache Flume Architecture and Core Concepts
Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.
Flume Hello World: NetCat Source + Memory Channel + Logger Sink
Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→Channel→Sink data flow.
Hive Metastore Three Modes and Remote Deployment
Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cluster.
HiveServer2 Configuration and Beeline Remote Connection
Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.
Hive DDL and DML Operations
Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations, with complete HiveQL examples and configuration optimization.
Hive HQL Advanced: Data Import/Export and Query Practice
Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation, filtering, and sorting.
MapReduce JOIN Four Implementation Strategies
Deep dive into four JOIN strategies in MapReduce: Reduce-Side Join, Map-Side Join, Semi-Join, and Bloom Join principles and Java implementations, with analysis of applicable scenarios and performance characteristics.
Hive Introduction: Architecture and Cluster Installation
Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop cluster.
HDFS Java Client Practice: Upload/Download Files, Directory Scan
Using Hadoop HDFS Java Client API for file operations: Maven dependency configuration, FileSystem/Path/Configuration core classes, implement file upload, download, delete, list scan and progress bar display.
Java Implementation MapReduce WordCount Complete Code
Implement Hadoop MapReduce WordCount from scratch: Hadoop serialization mechanism detailed explanation, writing Mapper, Reducer, Driver three components, Maven project configuration, local and cluster run complete code.
HDFS Distributed File System Read/Write Principle
Deep dive into HDFS architecture: NameNode, DataNode, Client roles, Block storage mechanism, file read/write process (Pipeline write and nearest read), and HDFS basic commands.
HDFS CLI Practice Complete Command Guide
Complete HDFS CLI practice: hadoop fs common commands including directory operations, file upload/download, permission management, with three-node cluster live demo.
Hadoop Cluster WordCount Distributed Computing Practice
Complete WordCount execution on Hadoop cluster: upload files to HDFS, submit MapReduce job, view running status through YARN UI, verify true distributed computing.
Hadoop JobHistoryServer Configuration and Log Aggregation
Configure Hadoop JobHistoryServer to record MapReduce job execution history, enable YARN log aggregation, view job details and logs via Web UI.
Hadoop Cluster SSH Passwordless Login Configuration and Distribution Script
Complete guide for Hadoop three-node cluster SSH passwordless login: generate RSA keys, distribute public keys, write rsync cluster distribution script, including pitfall notes and /etc/hosts configuration points.
Hadoop Cluster Startup and Web UI Verification
Complete startup process for Hadoop three-node cluster: format NameNode, start HDFS and YARN, verify cluster status via Web UI, including start-dfs.sh and start-yarn.sh usage.
Basic Environment Setup: Hadoop Cluster
Detailed tutorial on setting up Hadoop cluster environment on 3 cloud servers (2C4G configuration), including HDFS, MapReduce, YARN components introduction, Java and Hadoop environment configuration steps.
Hadoop Cluster XML Configuration Details
Detailed explanation of Hadoop cluster three-node XML configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, including NameNode, DataNode, ResourceManager configuration instructions.