Tag: 大数据

284 articles

AI Research #131: Java 17/21/25 Complete Comparison

Java 17 (2021), Java 21 (2023), Java 25 (2025) language and JVM changes, covering Virtual Threads (Project Loom), Records/Pattern Matching (Project Amber),...

11/26/2025

AI Investigation #54: Big Data Industry Applications and Technology Selection Trends

Big data has achieved deep integration in finance, e-commerce, internet, communications, manufacturing, healthcare, education and other industries, becoming the core engine for business innovation.

8/18/2025

AI Investigation #53: Big Data Talent Landscape - Experience Distribution, Growth Paths and Industry Trends

The talent structure in the big data industry shows characteristics of youth and rapid growth. The 25-30 age group is the main force, while 30-35 year-olds are gradually becoming the core strength.

8/17/2025

AI Investigation #52: Big Data Technology Landscape - Lakehouse, Data Mesh, Serverless and Emerging Trends

Big data technology is undergoing a new wave of transformation. Lakehouse architecture combines the advantages of data lakes and data warehouses. Data Mesh...

8/16/2025

AI Investigation #51: Big Data Technology Evolution - Obsolete Frameworks and Architectures

Big data technology evolution: MapReduce replaced by Spark, Storm replaced by Flink, Pig/Hive gradually phased out. This article analyzes why these technologies were eliminated and the technical reasoning behind the evolution.

8/14/2025

AI Investigation #50: Big Data Evolution - Two Decades of Architecture Transformation from Hadoop Batch to Flink Real-time Computing

Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing. Architecture evolved from monolithic Hadoop to YARN multi-engine, then to cloud-native Kubernetes.

8/13/2025

AI Research 49 - Big Data Survey Report: Development History: From Concept Birth to Diversified Ecosystem 1997-2025

Big data development began in 1997 when NASA proposed the concept, 2003-2006 Google published GFS, MapReduce, Bigtable three major papers leading distributed computing revolution. 2005 saw Hadoop born, 2008 became Apache top project forming complete ecosystem.

8/12/2025

Spark MLlib GBDT Case Study: Residual Calculation to Regression Tree Iteration

GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training. Covers GBDT...

6/4/2025

Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting Trees

Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles. Main content:...

6/3/2025

Spark MLlib GBDT Algorithm: Gradient Boosting Principles, Negative Gradient and XGBoost

This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm. First explains boosting tree basic concept through simple examples, then details algorithm flow including negative gradient calculation, regression tree fitting, and model update steps.

6/3/2025

Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Explained

This article systematically introduces ensemble learning methods in machine learning. Main content includes: 1) Basic definition and classification of ensemble...

6/2/2025

Spark MLlib Decision Tree Pruning: Pre-pruning, Post-pruning and ID3 vs C4.5 vs CART

This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms...

5/29/2025

Spark MLlib Decision Tree: Classification Principles, Gini Coefficient and Entropy

This article introduces the basic concepts, classification principles, and classification principles of decision trees. Decision tree is a non-linear...

5/28/2025

Spark MLlib Logistic Regression: Input Function, Sigmoid, Loss and Diabetes Prediction

This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression. Logistic regression is an efficient binary classification algorithm widely used in fields such as ad click-through rate prediction and spam email identification.

5/27/2025

Spark MLlib Linear Regression: Scenarios, Loss Function and Gradient Descent Optimization

Linear regression uses regression equations to model relationships between independent and dependent variables. This article covers regression scenarios (house...

4/11/2025

Big Data #268: Real-time Warehouse ODS Layer - Writing Kafka Dimension Tables to DIM

Writing dimension tables (DIM) from Kafka typically involves reading real-time or batch data from Kafka topics and updating dimension tables based on the data...

1/3/2025

Big Data #269: Real-time Warehouse DIM, DW and ADS Layer Processing

DW (Data Warehouse layer) is built from DWD, DWS, and DIM layer data, completing data architecture and integration, establishing consistent dimensions, and...

1/3/2025

Spark MLlib Logistic Regression: Sigmoid, Loss Function and Diabetes Prediction Case

Logistic regression is a classification model in machine learning — an efficient binary classification algorithm widely used in ad click-through rate...

1/3/2025

Big Data #266: Canal Integration with Kafka - Real-time Data Warehouse

This article introduces Alibaba's open-source Canal tool, which implements Change Data Capture (CDC) by parsing MySQL binlog. Demonstrates how to integrate...

1/2/2025

Realtime Warehouse - ODS Lambda Architecture Kappa Architecture Core Concepts

In internet companies, common ODS data includes business log data (Log) and business DB data. For business DB data, collecting data from relational databases...

1/2/2025

Spark MLlib Linear Regression: Scenarios, Loss Function and Gradient Descent

Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent...

1/2/2025

Canal Deployment: Installation, Service Startup and Common Issues

Canal is an open-source data synchronization tool from Alibaba for MySQL database incremental log parsing and synchronization. It simulates the MySQL slave...

12/31/2024

Canal Working Principle: Workflow and MySQL Binlog Introduction

Canal is an open-source tool for MySQL database binlog incremental subscription and consumption, primarily used for data synchronization and distributed...

12/30/2024

MySQL Binlog Deep Dive: Storage Directory, Change Records and Canal Configuration

MySQL's Binary Log (binlog) is a log file type in MySQL that records all change operations performed on the database (excluding SELECT and SHOW queries). It is...

12/30/2024

Canal Data Sync: Introduction, Background, Principles and Advantages

Alibaba B2B's cross-region business between domestic sellers and overseas buyers drove the need for data synchronization between Hangzhou and US data centers.

12/29/2024

Realtime Warehouse - Business Database Table Structure: Trade Orders, Order Products, Product Categories, Merchant Stores, Regional Organization Tables

Realtime data warehouse is a data warehouse system that differs from traditional batch processing data warehouses by emphasizing low latency, high throughput,...

12/28/2024

Real-time Data Warehouse: Background, Architecture, Requirements and Technology Selection

Real-time data processing capability has become a key competitive factor for enterprises. Initially, each new requirement spawned a separate real-time task,...

12/27/2024

Apache Griffin Configuration: pom.xml, sparkProperties and Build Startup

Apache Griffin is an open-source data quality management framework designed to help organizations monitor and improve data quality in big data environments.

12/25/2024

Big Data 258 - Griffin with Livy: Architecture, Installation and Hadoop/Hive Configuration

Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios. Its main...

12/24/2024

Big Data 257 - Data Quality Monitoring: Monitoring Methods and Griffin Architecture

Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection. It can measure data assets from...

12/23/2024

Flink CEP: Complex Event Processing & Pattern Matching

Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.

12/21/2024

Big Data 256 - Atlas Installation: Service Startup, Web Access and Hive Lineage Import

Metadata (MetaData) in the narrow sense refers to data that describes data. Broadly, all information beyond business data used to maintain system operation can...

12/21/2024

Big Data 255 - Atlas Data Warehouse Metadata Management: Data Lineage and Metadata

Atlas is a metadata framework for the Hadoop platform: a set of scalable core governance services enabling enterprises to effectively meet compliance...

12/20/2024

Flink Memory Management: Network Buffer, State Backend & GC Tuning

Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.

12/18/2024

Flink Parallelism: Operator Chaining, Slot & Resource Scheduling

Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.

12/18/2024

Airflow Core Trade Task Scheduling Integration for Offline Data Warehouse

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...

12/17/2024

Airflow Core Concepts: DAG, Operators, Tasks and Python Script Writing

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...

12/16/2024

Airflow Crontab Scheduling: Introduction, Task Integration and Getting Started

Linux systems use the cron (crond) service for scheduled tasks, which is enabled by default. Linux also provides the crontab command for user-level task...

12/15/2024

Offline Data Warehouse ADS Layer and Airflow Task Scheduling System

Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.

12/14/2024

Apache Airflow Installation and Deployment for Offline Data Warehouse

Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...

12/14/2024

Flink Broadcast State: BroadcastState Practice & Rule Updates

Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadcast stream join through cases.

12/14/2024

Flink State Backend: State Storage & Performance Optimization

Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.

12/14/2024

Offline Data Warehouse - Hive Order Zipper Table: Incremental Refresh Implementation

This article continues the zipper table practice, focusing on order history state incremental refresh. It explains how to use ODS daily incremental table + DWD zipper table to preserve order historical states at low cost while supporting daily backtracking and change analysis.

12/12/2024

Offline Data Warehouse Dimension Tables: Product Category, Region Organization, Product Information

First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables. Dimension table processing strategies vary by...

12/12/2024

Offline Data Warehouse DWD and DWS Layer: Table Creation and ETL Scripts

The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables. Order statuses...

12/12/2024

Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh

This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking and change analysis. It covers incremental refresh of order status changes using 2020 order data as a case study.

12/11/2024

Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script

This article provides a practical guide to Hive zipper table implementation for offline data warehouse modeling, covering initial loading, daily incremental updates, historical version chain closing, Shell scheduling scripts, and rollback recovery logic.

12/10/2024

Offline Data Warehouse - Hive Zipper Table Getting Started: SCD Types, Table Creation and Loading

This article systematically covers Slowly Changing Dimensions (SCD), detailing the core differences between SCD Type 0, 1, 2, 3, 4, and 6, and explains the applicable boundaries of snapshot tables and zipper tables in Hive offline data warehouse scenarios.

12/9/2024

Offline Data Warehouse: Hive ODS Layer Table Creation and Partition Loading Practice

Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning. Enables fast queries of raw transaction records within 7 days, demonstrating core characteristics of ODS layer.

12/7/2024

Offline Data Warehouse: E-commerce Core Transaction Incremental Import (DataX - HDFS - Hive Partition)

Using DataX (MySQLReader + HDFSWriter) to extract daily incremental data from MySQL order tables, order detail tables, and product information tables into...

12/6/2024

Offline Data Warehouse Practice: E-commerce Core Transaction Data Model & MySQL Source Table Design

Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).

12/4/2024

Flink State and Checkpoint: State Management, Fault Tolerance & Savepoint

Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.

12/4/2024

Offline Data Warehouse Advertising Business Hive ADS Practice: DataX Export HDFS Partition Table to MySQL

Complete solution for exporting Hive ADS layer data to MySQL using DataX. Covers ADS loading, DataX configuration, MySQL table creation, Shell script parameterized execution, and common error diagnosis and fix checklist.

12/3/2024

Offline Data Warehouse Advertising Business: Flume Import Logs to HDFS, Complete Hive ODS/DWD Layer Loading

Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structure, log file upload, Flume startup command, HDFS write verification.

12/2/2024

Offline Data Warehouse Advertising Business Hive Analysis Practice: ADS Click-Through Rate, Purchase Rate and Top100 Ranking

Implementation of advertising impression, click, purchase hourly statistics based on Hive offline data warehouse, completing CTR, CVR and advertising effect...

11/30/2024

Flink Streaming Introduction: DataStream API & Program Structure

Flink DataStream API getting started guide, program execution flow, environment acquisition, data source definition, operator chaining and execution mode details, demonstrating stream processing program development through WordCount case.

11/30/2024

Flink Window and Watermark: Time Windows, Tumbling/Sliding, Session Windows & Late Data Processing

Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing mechanism.

11/30/2024

Offline Data Warehouse Hive Advertising Business Practice: ODS→DWD Event Parsing, Advertising Detail and Conversion Analysis

Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...

11/29/2024

Offline Data Warehouse Member Metrics Verification, DataX Export and Advertising Business ODS/DWD/ADS Full Process

Offline data warehouse practice based on Hadoop + Hive + HDFS + DataX + MySQL, covering member metrics testing (active/new/retention), HDFS export, DataX sync to MySQL, and advertising business ODS/DWD/ADS full process modeling.

11/28/2024

Offline Data Warehouse Practice: Flume+HDFS+Hive Building ODS/DWD/DWS/ADS Member Analysis Pipeline

Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered processing, supporting active members, new members, member retention metrics calculation.

11/27/2024

Flink Installation & Deployment: Local, Standalone, YARN Modes

Complete tutorial for Apache Flink installation and deployment in three modes: Local, Standalone cluster, and YARN integration, including environment configuration, parameter tuning, and common issue solutions.

11/27/2024

Flink on YARN Deployment: Environment Preparation, Resource Application & Job Submission

Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and job submission process.

11/27/2024

Offline Data Warehouse Hive ADS Export MySQL DataX Practice: Configuration and Pitfalls

The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter. Focuses on DataX JSON configuration and common error fixes.

11/26/2024

Offline Data Warehouse Retention Rate Implementation: DWS Detail Modeling + ADS Aggregation

Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dws_member_retention_day table to join new member and startup detail tables to...

11/25/2024

Offline Data Warehouse Hive New Member & Retention: DWS Detail + ADS Summary One-Pass

Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'. Use 'full member table (with first day dt)' as deduplication anchor, DWS outputs new member details, ADS outputs new member count.

11/23/2024

Apache Flink Introduction: Unified Stream-Batch Real-Time Computing Engine

Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch processing model, and comparison with Spark Streaming for technology selection.

11/23/2024

Offline Data Warehouse Hive Practice: DWD to DWS Daily/Weekly/Monthly Active Member ADS Metrics Implementation

This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly). Covers the complete flow from DWD...

11/22/2024

Hive ODS Layer JSON Parsing: UDF Array Extraction, explode and JsonSerDe

JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...

11/21/2024

Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing

Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...

11/20/2024

Spark Streaming Integration with Kafka: Receiver and Direct Mode Complete Analysis

Detailed explanation of two Spark Streaming integration modes with Kafka: Receiver-based high-level API vs Direct mode architecture differences, offset management, Exactly-Once semantics guarantee, and complete Scala code implementation.

11/20/2024

Flume Taildir + Custom Interceptor: Extract JSON Timestamp to Write HDFS Daily Partitions

Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...

11/19/2024

Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection with logtime/logtype HDFS Partitioning

Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...

11/18/2024

Flume Optimization for Offline Data Warehouse: batchSize, Channel, Compression and OOM Fix

Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...

11/16/2024

Spark DStream Transformation Operators: map, reduceByKey, transform Practice

Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blacklist filtering: leftOuterJoin, SQL, and broadcast variables.

11/16/2024

Spark Streaming Window Operations & State Tracking: updateStateByKey and mapWithState

In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-state maintenance and mapWithState incremental optimization, with complete Scala code.

11/16/2024

Offline Data Warehouse Member Metrics Practice

Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.

11/15/2024

How to Build an Offline Data Warehouse: Tracking → Metric System → Theme Analysis, Full Chain Guide

Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.

11/14/2024

Offline Data Warehouse Architecture Selection and Cluster Sizing: Apache vs CDH/HDP, Full Component List + Naming Standards

Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...

11/14/2024

Offline Data Warehouse Layering: ODS/DWD/DWS/DIM/ADS Architecture Guide

When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...

11/13/2024

Offline Data Warehouse Modeling Practice

Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.

11/13/2024

Spark Streaming Introduction: From DStream to Structured Streaming Evolution

Introduction to Spark's two generations of real-time computing frameworks: DStream micro-batch processing model's architecture and limitations, and how Structured Streaming solves EventTime processing and API consistency issues through unbounded table model and Catalyst optimization.

11/13/2024

Spark Streaming Data Sources: File Stream, Socket, RDD Queue Stream

Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation, with complete Scala code examples.

11/13/2024

Grafana 11.3.0 Installation & Startup: YUM Install RPM, systemd Management, Login & Common Pitfalls

For OPs/devs still using CentOS/RHEL (including compatible distributions) in 2026, provides Grafana 11.3.0 (grafana-enterprise-11.3.0-1.x86_64.rpm) direct YUM...

11/12/2024

Data Warehouse Introduction: Four Characteristics, OLTP vs OLAP Differences & Enterprise Data Warehouse Architecture

2026 engineering practice, covering core concepts and implementation concerns for data warehouses: starting from enterprise data silos, explaining four...

11/12/2024

Prometheus 2.53.2 Installation & Configuration Practice: Scrape Targets, Exporter, Alert Chain & Common Troubleshooting

Prometheus 2.53.2 (still common in existing environments in 2025/2026) provides a reusable deployment process: download and extract binary on monitoring...

11/11/2024

Prometheus Node Exporter 1.8.2 + Pushgateway 1.10.0: Download, Start, Integration & Pitfalls

Common Prometheus monitoring落地场景: Install node_exporter-1.8.2 on Rocky Linux (CentOS/RHEL compatible) to expose host metrics, integrate with Prometheus...

11/11/2024

sklearn KMeans Key Attributes & Evaluation: cluster_centers_, inertia_, Silhouette Score for K Selection

scikit-learn (sklearn) KMeans (2026) explains three most commonly used objects: cluster_centers_ (cluster centers), inertia_ (Within-Cluster Sum of Squares),...

11/9/2024

KMeans n_clusters Selection: Silhouette Score Practice + init/n_init/random_state Version Pitfalls (scikit-learn 1.4+)

KMeans n_clusters selection method: calculate silhouette_score and silhouette_samples on candidate cluster numbers (e.g., 2/4/6/8), determine optimal k by...

11/9/2024

SparkSQL Statements: DataFrame Operations, SQL Queries & Hive Integration

Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for metadata and table operations.

11/9/2024

SparkSQL Kernel: Five Join Strategies & Catalyst Optimizer Analysis

Deep dive into SparkSQL's five Join execution strategies (BHJ, SHJ, SMJ, Cartesian, BNLJ) selection conditions and use cases, along with the complete processing flow of Catalyst optimizer from SQL parsing to code generation.

11/9/2024

Python Hand-Written K-Means Clustering on Iris Dataset: From Distance Function to Iterative Convergence & Pitfalls

Python K-Means clustering implementation: using NumPy broadcasting to compute squared Euclidean distance (distEclud), initializing centroids via uniform...

11/8/2024

K-Means Clustering Practice: Self-Implemented Algorithm Verification + sklearn KMeans Parameters/labels_/fit_predict Quick Guide

K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification...

11/8/2024

Scikit-Learn Logistic Regression Implementation: max_iter, Classification Method & Multiclass Optimization

When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy. If training doesn't...

11/7/2024

K-Means Clustering Guide: From Unsupervised Concepts to Inertia, K Selection & Pitfalls

K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed), with engineering applications in customer...

11/7/2024

Deep Understanding of Logistic Regression & Gradient Descent Optimization Algorithm

Logistic Regression (LR) is an important classification algorithm in machine learning, widely used in binary classification tasks like sentiment analysis,...

11/6/2024

How to Implement Logistic Regression in Scikit-Learn and Regularization Detailed (L1 and L2)

As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8, training...

11/6/2024

SparkSQL Core Abstractions: RDD, DataFrame, Dataset & SparkSession

Deep comparison of Spark's three data abstractions RDD, DataFrame, Dataset features and use cases, introduction to SparkSession unified entry, and demonstration of mutual conversion methods between abstractions.

11/6/2024

SparkSQL Operators: Transformation & Action Operations

Systematically review SparkSQL Transformation and Action operators, covering select, filter, join, groupBy, union operations, with practical test cases demonstrating usage and performance optimization.

11/6/2024

How to Handle Multicollinearity: Common Problems & Solutions in Least Squares Linear Regression

When using scikit-learn for linear regression, how to handle multicollinearity in least squares method. Multicollinearity may cause instability in regression...

11/5/2024

Ridge Regression and Lasso Regression: Differences, Applications and Selection Guide

Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine...

11/5/2024

Linear Regression Machine Learning Perspective: Matrix Representation, SSE Loss & Least Squares

Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown; use loss function to characterize...

11/4/2024

NumPy Matrix Multiplication Hand-written Multivariate Linear Regression: Normal Equation, SSE/MSE/RMSE & R²

pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation). Core idea is to form normal...

11/4/2024

sklearn Decision Tree Pruning Parameters: max_depth/min_samples_leaf to min_impurity_decrease

Common parameters for decision tree pruning (pre-pruning) in engineering: max_depth, min_samples_leaf, min_samples_split, max_features, min_impurity_decrease...

11/2/2024

Confusion Matrix to ROC: Complete Review of Imbalanced Binary Classification Evaluation Metrics

Confusion matrix (TP, FP, FN, TN) establishes unified口径, explains business meaning of Accuracy, Precision (查准率), Recall (查全率/敏感度), F1 Measure: Precision...

11/2/2024

Spark Standalone Mode: Architecture & Performance Tuning

Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and RDD optimization strategies.

11/2/2024

SparkSQL Introduction: SQL & Distributed Computing Fusion

Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integration with Hive/HDFS.

11/2/2024

Decision Tree from Split to Pruning: Information Gain/Gain Ratio, Continuous Variables & CART Essentials

Complete chain from 'split' to 'pruning', explain why usually uses greedy algorithm to form 'local optimum', and differences in splitting criteria between...

11/1/2024

sklearn Decision Tree Practice: criterion, Graphviz Visualization & Pruning to Prevent Overfitting

Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version). Focus on...

11/1/2024

Decision Tree Model Detailed: Node Structure, Conditional Probability Perspective & Shannon Entropy Calculation

Decision Tree model systematic overview for classification tasks: three types of nodes (root/internal/leaf), recursive split flow from root to leaf, and...

10/31/2024

Decision Tree Information Gain Detailed: Information Entropy, ID3 Feature Selection & Python Optimal Split Implementation

Decision tree information gain (Information Gain) explained, first using information entropy (Entropy) to explain impurity, then explaining why when splitting...

10/31/2024

K-Fold Cross-Validation Practice: sklearn Look at Mean/Variance, Select More Stable KNN Hyperparameters

Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation. Through sklearn's cross_val_score to...

10/30/2024

KNN Must Normalize First: Min-Max Correct Method, Data Leakage Pitfall & sklearn Implementation

In scikit-learn machine learning training pipeline, distance-based models like KNN are extremely sensitive to inconsistent feature scales: Euclidean distance...

10/30/2024

Spark RDD Fault Tolerance: Checkpoint Principle & Best Practices

Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long dependency chain scenarios.

10/30/2024

Spark Broadcast Variables: Efficient Shared Read-Only Data

Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices, and performance optimization solution using broadcast to implement MapSideJoin instead of shuffle join.

10/30/2024

KNN/K-Nearest Neighbors Algorithm Practice: Euclidean Distance + Voting Mechanism Handwritten Implementation, with Visualization & Tuning Points

KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python...

10/29/2024

scikit-learn KNN Practice: KNeighborsClassifier, kneighbors & Learning Curve to Select Optimal

From unified API (fit/predict/transform/score) to kneighbors to find K nearest neighbors of test samples, then using learning curve/parameter curve to select...

10/29/2024

Apache Tez Practice: Hive on Tez Installation & Configuration, DAG Principles & Common Pitfalls

Apache Tez (example version Tez 0.9.x) as execution engine alternative to MapReduce on Hadoop2/YARN, providing DAG (Directed Acyclic Graph) execution model for...

10/28/2024

Data Mining: From Wine Classification to Machine Learning Overview - Supervised/Unsupervised/Reinforcement Learning, Feature Space & Overfitting

2025's most commonly used machine learning concept framework: supervised learning (classification/regression), unsupervised learning (clustering/dimensionality...

10/28/2024

Elasticsearch Cluster Planning & Tuning: Node Roles, Shard/Replica, Write & Search Optimization

Master / Data / Coordinating node responsibilities and production role isolation strategies, capacity planning calculations (JVM Heap 30-32GB limit, hot/cold data with disk/IO constraints, horizontal scaling path), plus shard and replica as core knobs for performance and reliability.

10/26/2024

DataX 3.0 Architecture & Practice: Reader/Writer Plugin Model, Job/TaskGroup Scheduling, speed/errorLimit Config

DataX (DataX 3.0) is an offline data synchronization/data integration tool widely used and open-sourced within Alibaba, for enterprise-level heterogeneous data...

10/26/2024

Spark Super WordCount: Text Cleaning & MySQL Persistence

Implement complete production-ready word frequency pipeline: lowercase conversion, punctuation removal, stop word filtering, word frequency counting, finally efficiently write to MySQL via foreachPartition, comparing row-by-row insert vs partition batch write performance.

10/26/2024

Spark Serialization & RDD Execution Principle

Deep dive into Spark Driver-Executor process communication, Java/Kryo serialization selection, closure serialization problem troubleshooting, and RDD dependencies, Stage division and persistence storage levels.

10/26/2024

Nginx JSON Logs to ELK: ZK+Kafka+Elasticsearch 7.3.0+Kibana Practice Setup

Configure Nginx log_format json to output structured access_log (containing @timestamp, request_time, status, request_uri, ua and other fields), start...

10/25/2024

Filebeat → Kafka → Logstash → Elasticsearch Practice

Filebeat collects Nginx access.log and writes to Kafka, Logstash consumes from Kafka and parses message embedded JSON by field (app/type) conditions, adds...

10/25/2024

Logstash Filter Plugin Practice: grok Parsing Console & Nginx Logs (7.3.0 Config Reusable)

Article explains using grok in Logstash 7.3.0 environment to extract structured fields from console stdin and Nginx access logs (IP, time_local, method, request, status etc), and quickly verify parsing effect through stdout { codec => rubydebug }.

10/24/2024

Logstash Output Plugin Practice: stdout/file/Elasticsearch Output Config & Tuning

Logstash Output plugin (Logstash 7.3.0) practical tutorial, covering stdout (rubydebug) for debugging, file output for local archiving, Elasticsearch output...

10/24/2024

Logstash 7 Getting Started: stdin/file Collection, sincedb/start_position Mechanism & Troubleshooting

Logstash 7 getting started tutorial, covering stdin/file collection, sincedb mechanism and start_position effect conditions, with error quick reference table

10/23/2024

Logstash JDBC vs Syslog Input: Principle, Scenario Comparison & Reusable Config (Based on Logstash 7.3.0)

Logstash Input plugin comparison, breakdown technical differences between JDBC Input and Syslog collection pipeline, applicable scenarios and key configs. JDBC...

10/23/2024

Spark Scala WordCount Implementation

Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.

10/23/2024

Spark Scala Practice: Pi Estimation & Mutual Friends

Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two approaches, comparing Cartesian product vs data transformation performance differences.

10/23/2024

Elasticsearch Concurrency Conflicts & Optimistic Lock, Distributed Data Consistency Analysis

Elasticsearch concurrency conflicts (inventory deduction read-modify-write) breakdown write overwrite cause, and gives engineering solution using ES optimistic...

10/22/2024

Elasticsearch Doc Values Mechanism Detailed: Columnar Storage Supporting Sort/Aggregation/Script

Disk columnar data structure generated at indexing time, optimized for sorting, aggregation and script values; most supported types enabled by default, text fields don't provide doc values by default, need to use keyword subfield or enable fielddata for aggregation/sorting.

10/22/2024

Elasticsearch Segment Merge & Disk Directory Breakdown: Merge Policy, Force Merge, Shard File Structure

Explains why refresh causes small segment increase, how segment merge merges small segments into large ones in background and cleans deleted documents, why too...

10/21/2024

Elasticsearch Inverted Index Underlying Breakdown: Terms Dictionary, FST, SkipList & Lucene Index Files

Article details core data structure of Elasticsearch inverted index: Terms Dictionary, Posting List, FST (Finite State Transducer) and SkipList how accelerate...

10/21/2024

Elasticsearch Inverted Index & Read/Write Process Full Analysis: From Lucene Principles to Query/Fetch Practice

Article analyzes Elasticsearch inverted index principle based on Lucene, compares forward index vs inverted index differences, covering core concepts like...

10/20/2024

Elasticsearch Near Real-time Search: Segment, Refresh, Flush, Translog Full Process Analysis

Article details core mechanism of Elasticsearch near real-time search, including Lucene Segment, Memory Buffer, File System Cache, Refresh, Flush and Translog...

10/20/2024

Spark Action Operations Overview

Comprehensive introduction to Spark RDD Action operations, covering data collection, statistical aggregation, element retrieval, storage output categories, and detailed explanation of Key-Value RDD core operators like groupByKey, reduceByKey, join.

10/19/2024

Elasticsearch Aggregation Practice: Metrics Aggregations & Bucket Aggregations Complete Usage & DSL Analysis

Covers complete practice of Metrics Aggregations and Bucket Aggregations, applicable to common Elasticsearch 7.x / 8.x versions in 2025. Article starts with...

10/18/2024

Elasticsearch 7.3 Java Practice: Index & Document CRUD Full Process Examples

elasticsearch-rest-high-level-client implements index and document CRUD, including: create index via JSON and XContentBuilder two ways, config shards and replicas, delete index, insert single document, query document by ID and use match_all to query all data.

10/18/2024

Elasticsearch Term Exact Query & Bool Combination Practice: range/regexp/fuzzy Full Examples

This article demonstrates Elasticsearch term-level queries including term, terms, range, exists, prefix, regexp, fuzzy, ids queries, and bool compound queries. Covers creating book index, inserting sample data, various query DSL examples and execution results.

10/17/2024

Elasticsearch Filter DSL Full Practice: Filter Query, Sort Pagination, Highlight & Batch Operations

This article introduces Filter DSL vs query difference: Filter DSL doesn't calculate relevance score, specifically optimized for filter scenario execution...

10/17/2024

Elasticsearch Mapping & Document CRUD Practice (Based on 7.x/8.x)

This article details Elasticsearch 7.x/8.x mapping config and document CRUD operations, including index/field mapping creation, mapping properties (type, index, store, analyzer), document create, query, full/partial update, delete by ID or condition.

10/16/2024

Elasticsearch Query DSL Practice: match/match_phrase/query_string/multi_match Full Analysis

In-depth explanation of core Query DSL usage in Elasticsearch 7.3, focusing on differences and pitfalls of match, match_phrase, query_string, multi_match and other full-text search statements in real business scenarios.

10/16/2024

Spark Cluster Architecture & Deployment Modes

Deep dive into Spark cluster core components Driver, Cluster Manager, Executor responsibilities, comparison of Standalone, YARN, Kubernetes deployment modes, and static vs dynamic resource allocation strategies.

10/16/2024

Elasticsearch-Head & Kibana 7.3.0 Practice: Installation Points, Connectivity & Common Pitfalls

Introduction to Elasticsearch-Head plugin and Kibana 7.3.0 installation and connectivity points, covering Chrome extension quick access, ES cluster health and...

10/15/2024

Elasticsearch Index Operations & IK Analyzer Practice: 7.3/8.15 Full Process Quick Reference

Elasticsearch index create, existence check (single/multi/all), open/close/delete and health troubleshooting, as well as IK analyzer installation, ik_max_word/ik_smart analysis and Nginx hosting scheme for remote extended dictionary/stop words.

10/15/2024

Elasticsearch Getting Started: Index/Document CRUD & Search Minimum Examples

Elasticsearch (ES 7.x/8.x) minimum examples: Create index, insert document, query by ID, update and _search search flow, with return samples and screenshots, help readers complete 'index/document CRUD' run-through in 3-10 minutes.

10/14/2024

Elasticsearch 7.3.0 Three-Node Cluster Practice: Directory/Parameter/Startup to Online

Elasticsearch 7.3.0 three-node cluster deployment practice tutorial, covering directory creation and permission settings, system parameter config...

10/14/2024

ELK Elastic Stack (ELK) Practice: Architecture Key Points, Indexing & Troubleshooting Checklist

Article introduces core capabilities and common practices of Elasticsearch 8.x, Logstash 8.x, Kibana 8.x, covering key aspects of centralized logging system: collection, transmission, indexing, shard/replica, query DSL, aggregation and ILM lifecycle management.

10/13/2024

Elasticsearch Single Machine Cloud Server Deployment & Operation Detailed Process

Elasticsearch is a distributed full-text search engine, supports single-node mode and cluster mode deployment. Generally, small companies can use Single-Node Mode for their business scenarios.

10/13/2024

Apache Kylin Cube7 Practice: Aggregation Group/RowKey/Encoding & Size Precision Comparison

Covers Aggregation Group, Mandatory Dimension, Hierarchy Dimension, Joint Dimension usage trade-offs, and explains impact of dictionary encoding, RowKey order, ShardBy sharding on build and query performance with CubeStatsReader precision/sparsity readings and RowKey/HBase storage model.

10/12/2024

Apache Kylin 1.6 Streaming Cubing Practice: Kafka to Minute-level OLAP

Kafka→Kylin real-time OLAP pipeline, providing minute-level aggregation queries for common 2025 business scenarios (e-commerce transactions, user behavior, IoT monitoring).

10/12/2024

Spark RDD Deep Dive: Five Key Features

Comprehensive analysis of Spark core data abstraction RDD's five key features (partitions, compute function, dependencies, partitioner, preferred locations), lazy evaluation, fault tolerance, and narrow/wide dependency principles.

10/12/2024

Spark RDD Creation & Transformation Operations

Detailed explanation of three RDD creation methods (parallelize, textFile, transform from existing RDD), and usage of common Transformation operators like map, filter, flatMap, groupBy, sortBy with lazy evaluation principles.

10/12/2024

Apache Kylin Segment Merge Practice: Manual/Auto Merge, Retention Policy & JDBC Examples

Apache Kylin Segment merge practice tutorial, covering manual MERGE Job flow, continuous Segment requirements, Auto Merge multi-level threshold strategy, Retention Threshold cleanup logic, deletion flow (Disable→Delete) and JDBC connection query examples.

10/11/2024

Apache Kylin Cuboid Pruning Practice: Derived Dimensions & Expansion Rate Control

Cuboid pruning optimization: When there are many dimensions, Cuboid count grows exponentially, causing long build time and storage expansion. Engineering...

10/11/2024

Apache Kylin Cube Practice: Complete Guide for Modeling, Build and Query Acceleration

Apache Kylin 4.0 Cube modeling and query acceleration method: Complete star modeling with fact tables and dimension tables, design dimensions and measures, use...

10/10/2024

Apache Kylin Incremental Cube & Segment Practice: Daily Partition Incremental Build Guide

Using date field of Hive partitioned table as Partition Date Column, split Cube into multiple Segments, incrementally build by range to avoid repeated computation of historical data; also compare full build vs incremental build differences in query paths.

10/10/2024

Apache Kylin Cube Practice: Hive Load & Pre-computation Acceleration (With Cuboid/Real-time OLAP, Kylin 4.x)

OLAP example: Generate dimension and fact data via Python, after Hive (wzk_kylin) load, design Cube in Kylin (dimensions/measures/Cuboids), and provide...

10/9/2024

Apache Kylin Cube Practice: From Modeling to Build and Query (With Pitfalls & Optimization)

Apache Kylin (3.x/4.x) Cube setup and optimization: complete flow from DataSource → Model → Cube, covering dimension modeling, measure design, Cuboid...

10/9/2024

From MapReduce to Spark: Big Data Computing Evolution

Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core components.

10/9/2024

Apache Kylin Comprehensive Guide: MOLAP Architecture, Hive/Kafka Practice & Real-time OLAP

Background, evolution and engineering practice of Apache Kylin, focusing on MOLAP solution implementation path for massive data analysis. Core keywords: Apache...

10/8/2024

Apache Kylin 3.1.1 Deployment on Hadoop 2.9/Hive 2.3/HBase 1.3 (With Pitfalls & Fixes)

Complete deployment record of Apache Kylin 3.1.1 on Hadoop 2.9.2, Hive 2.3.9, HBase 1.3.1, Spark 2.4.5 (without-hadoop, Scala 2.12) and three-node...

10/8/2024

Kafka Storage Mechanism: Log Segmentation & Retention

Deep analysis of Kafka log storage architecture, including LogSegment design, sparse offset index and timestamp index principles, message lookup flow, and log retention and cleanup strategy configuration.

10/5/2024

Kafka High Performance: Zero-Copy, mmap & Sequential Write

Deep dive into Kafka's three I/O technologies achieving high throughput: sendfile zero-copy, mmap memory mapping and page cache sequential write, revealing kernel-level optimization behind million messages per second.

10/5/2024

Kafka Replica Mechanism: ISR & Leader Election

Deep dive into Kafka replica mechanism, including ISR sync node set maintenance, Leader election process, and unclean election trade-offs between consistency and availability.

10/2/2024

Kafka Exactly-Once: Idempotence & Transactions

Systematic explanation of how Kafka achieves Exactly-Once semantics through idempotent producers and transactions, covering PID/sequence number principle, cross-partition transaction configuration and end-to-end EOS implementation.

10/2/2024

Apache Druid Storage & Query Architecture: Segment/Chunk/Roll-up/Bitmap Explained

Apache Druid data storage and high-performance query path: from DataSource/Chunk/Segment layering, to columnar storage, Roll-up pre-aggregation, Bitmap...

9/30/2024

Apache Druid + Kafka Real-time Analysis: JSON Flattening Ingestion & SQL Metrics Full Process

Scala Kafka Producer writes order/click data to Kafka Topic (example topic: druid2), continuous ingestion in Druid through Kafka Indexing Service. Since...

9/30/2024

Apache Druid Real-time Kafka Ingestion: Complete Practice from Ingestion to Query

Complete practice of Apache Druid real-time Kafka ingestion, using network traffic JSON as example, completing data ingestion through Druid console's Streaming/Kafka wizard, parsing time column, setting dimensions and metrics, and verifying results with SQL.

9/29/2024

Apache Druid Architecture & Component Responsibilities: Coordinator/Overlord/Historical Deep Dive

Apache Druid component responsibilities and deployment points from 0.13.0 to current (2025): Coordinator manages Historical node Segment...

9/29/2024

Apache Druid Cluster Deployment [Part 1]: MySQL Metadata + HDFS Deep Storage & Low-Config Tuning

Apache Druid 30.0.0 deployable solution covering MySQL metadata storage (mysql-connector-java 8.0.19), HDFS deep storage and HDFS indexing-logs, plus Kafka...

9/28/2024

Apache Druid Cluster Mode [Part 2]: Low-Memory Cluster Practice: JVM/DirectMemory & Startup Scripts

Low-memory cluster practice for Apache Druid 30.0.0 on three nodes: provides JVM parameters and runtime.properties key items for Broker/Historical/Router, explains off-heap memory and processing buffer ratio relationship.

9/28/2024

Kafka Topic, Partition & Consumer: Rebalance Optimization

Deep dive into Kafka Topic, Partition, Consumer Group core mechanisms, covering custom deserialization, offset management and rebalance optimization configuration.

9/28/2024

Kafka Topic Management: Commands & Java API

Comprehensive introduction to Kafka Topic operations, including kafka-topics.sh commands, replica assignment strategy principles, and KafkaAdminClient Java API core usage.

9/28/2024

Apache Druid Real-time OLAP Architecture & Selection Points

Apache Druid real-time OLAP practice: suitable for event detail with time as primary key, sub-second aggregation and high-concurrency self-service analysis.

9/27/2024

Apache Druid Single-Machine Deployment: Architecture Overview, Startup Checklist & Quick Troubleshooting

Apache Druid 30.0.0 for single-machine quick verification and engineering implementation, systematically reviewing Druid architecture (Coordinator, Historical,...

9/27/2024

Flink Write to Kudu Practice: Custom Sink Full Process (Flink 1.11/Kudu 1.17/Java 11)

Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test). Through RichSinkFunction custom sink,...

9/25/2024

Kafka Producer Interceptor & Interceptor Chain

Introduction to Kafka 0.10 Producer interceptor mechanism, covering onSend and onAcknowledgement interception points, interceptor chain execution order and error isolation, with complete custom interceptor implementation.

9/25/2024

Kafka Consumer: Consumption Flow, Heartbeat & Parameter Tuning

Detailed explanation of Kafka Consumer Group consumption model, partition assignment strategy, heartbeat keep-alive mechanism, and tuning practices for key parameters like session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms.

9/25/2024

Apache Kudu Docker Quick Deployment: 3 Master/5 TServer Practice & Pitfalls Quick Reference

Apache Kudu Docker Compose quick deployment solution on Ubuntu 22.04 cloud host, covering Kudu Master and Tablet Server components,...

9/24/2024

Java Access Apache Kudu: Table Creation to CRUD (Including KuduSession Flush Mode & Multi-Master Config)

Java client (kudu-client 1.4.0) connects to Apache Kudu with multiple Masters (example ports 7051/7151/7251), completes full process of table creation, insert,...

9/24/2024

Apache Kudu: Real-time Write + OLAP Architecture, Performance & Integration

Apache Kudu in 2025 version and ecosystem integration: Latest Kudu 1.18.0 (2025/07) released, bringing segmented LRU Block Cache and RocksDB-based metadata...

9/23/2024

Apache Kudu Architecture & Practice: RowSet, Partition & Raft Deep Dive

Apache Kudu's Master/TabletServer architecture, RowSet (MemRowSet/DiskRowSet) write/read path, MVCC, and Raft consensus role in replica and failover; provides...

9/23/2024

ClickHouse MergeTree Partition/TTL, Materialized View, ALTER & system.parts Full Process Example

ClickHouse beginner and operations practice, based on real cluster (h121/h122/h123) demonstrating complete process from connection to database/table creation,...

9/21/2024

Kafka Producer Message Sending Flow & Core Parameters

Deep analysis of Kafka Producer initialization, message interception, serialization, partition routing, buffer batch sending, ACK confirmation and complete sending chain, with key parameter tuning suggestions.

9/21/2024

Kafka Serialization & Partitioning: Custom Implementation

Deep dive into Kafka message serialization and partition routing, including complete code for custom Serializer and Partitioner, mastering precise message routing and efficient transmission.

9/21/2024

ClickHouse Replica Deep Dive: ReplicatedMergeTree + ZooKeeper from 0-1

ClickHouse replica full chain: ZK/Keeper preparation, macros configuration, ON CLUSTER consistent table creation, write deduplication & replication mechanism,...

9/20/2024

ClickHouse Sharding × Replica × Distributed: ReplicatedMergeTree, Keeper, insert_quorum & Load Balancing

ClickHouse sharding × replica × Distributed architecture: Based on ReplicatedMergeTree + Distributed, using ON CLUSTER one-click table creation on 3-shard ×...

9/20/2024

ClickHouse MergeTree Best Practices: Replacing Deduplication, Summing Aggregation, Partition Design & Materialized View Alternatives

ClickHouse two light aggregation engines ReplacingMergeTree and SummingMergeTree, combined with minimum runnable examples (MRE) and comparative queries,...

9/19/2024

ClickHouse CollapsingMergeTree & External Data Sources: HDFS/MySQL/Kafka Integration

ClickHouse external data source engine minimum feasible solution: DDL templates, key parameters and read/write链路 for ENGINE=HDFS, ENGINE=MySQL, ENGINE=Kafka.

9/19/2024

ClickHouse MergeTree Practical Guide: Partition, Sparse Index & Merge Mechanism

ClickHouse MergeTree key mechanisms: batch writes form parts, background merge (Compact/Wide two part forms), ORDER BY is sparse primary index,...

9/18/2024

ClickHouse MergeTree Deep Dive: Partition Pruning × Sparse Primary Index × Marks × Compression

ClickHouse MergeTree storage and query path: column files (*.bin), sparse primary index (primary.idx), marker files (.mrk/.mrk2) and index_granularity...

9/18/2024

Kafka Operations: Shell Commands & Java Client Examples

Covers Kafka daily operations: daemon startup, Shell topic management commands, and Java client programming (complete Producer/Consumer code) with key configuration parameters and ConsumerRebalanceListener usage.

9/18/2024

Spring Boot Integration with Kafka

Detailed guide on integrating Kafka in Spring Boot projects, including dependency configuration, KafkaTemplate sync/async message sending, and complete @KafkaListener consumption practice.

9/18/2024

Spark Distributed Environment Setup

Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.sh core config adjustments, and complete multi-node cluster distribution and startup.

9/18/2024

ClickHouse Cluster Connectivity Self-Check & Data Types Guide | Run ON CLUSTER in 10 Minutes

Using three-node cluster (h121/122/123) as example, first complete cluster connectivity self-check: system.clusters validation → ON CLUSTER create...

9/14/2024

ClickHouse Table Engines: TinyLog/Log/StripeLog/Memory/Merge Selection Guide

Sort through ClickHouse table engines: TinyLog, Log, StripeLog, Memory, Merge principles, applicable scenarios and pitfalls, provide reproducible minimum...

9/14/2024

Kafka Components: Producer, Broker, Consumer Full Flow

Deep dive into Kafka's three core components: Producer partitioning strategy and ACK mechanism, Broker Leader/Follower architecture, Consumer Group partition assignment and offset management.

9/14/2024

Kafka Installation: From ZooKeeper to KRaft Evolution

Introduction to Kafka 2.x vs 3.x core differences, detailed cluster installation steps, ZooKeeper configuration, Broker parameter settings, and how KRaft mode replaces ZooKeeper dependency.

9/14/2024

ClickHouse Concepts & Basics | Why Fast? Columnar + Vectorized + MergeTree Comparison

For high-concurrency, low-latency OLAP scenarios, this article explains ClickHouse's underlying advantages (columnar+compression+vectorized, MergeTree family),...

9/13/2024

ClickHouse Single Machine + Cluster Node Deployment Guide | Installation Configuration | systemd Management / config.d

Official recommended keyring + signed-by installation of ClickHouse on Ubuntu, start with systemd and self-check; provides single machine minimum example...

9/13/2024

Flink CEP Practice: 24 Hours ≥5 Transactions & 10 Minutes Unpaid Detection Cases

Flink CEP (Complex Event Processing) complex event processing mechanism, combined with actual cases to deeply explain its application principles and practical...

9/12/2024

Flink SQL Quick Start | Table API + SQL in 3 Minutes with toChangelogStream

Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE), Table API and...

9/12/2024

Flink CEP Deep Dive: Complex Event Processing Complete Guide

Flink CEP is the core component for real-time analysis of complex event streams in Flink, providing a complete pattern matching framework, supporting...

9/11/2024

Flink CEP Timeout Event Extraction: Complete Guide with Malicious Login Detection Case

Flink CEP timeout event extraction is a key环节 in stream processing, used to capture partial matching events that exceed the window time (within) during pattern...

9/11/2024

Redis High Availability: Master-Slave Replication & Sentinel

Deep dive into Redis high availability: master-slave replication, Sentinel automatic failover, and distributed lock design with Docker deployment examples.

9/11/2024

Kafka Architecture: High-Throughput Distributed Messaging

Systematic introduction to Kafka core architecture: Topic/Partition/Replica model, ISR mechanism, zero-copy optimization, message format and typical use cases.

9/11/2024

Flink StateBackend Deep Dive: Memory, Fs, RocksDB & OperatorState Management

ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale. Developers can use ManagedOperatorState by implementing CheckpointedFunction interface, supporting ListState and BroadcastState two data structures.

9/10/2024

Flink Parallelism Deep Dive: From Concepts to Best Practices

In Flink, Parallelism is the core parameter measuring task concurrent processing capability, determining the number of tasks that can run simultaneously for...

9/10/2024

Flink Broadcast State: Dynamic Logic Updates in Real-time Stream Computing

Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications, widely used in real-time risk control,...

9/9/2024

Flink State Backend: Memory, Fs, RocksDB & Performance Differences

State Storage (State Backend) is the core mechanism for implementing stateful stream computing in Flink, determining data reliability, performance and fault...

9/9/2024

Flink Parallelism Setting Priority: Principles, Configuration & Best Practices

A Flink program consists of multiple Operators (Source, Transformation, Sink). An Operator is executed by multiple parallel Tasks (threads), and the number of...

9/7/2024

Flink State: Keyed State, Operator State & KeyGroups Working Principles

Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap...

9/7/2024

Redis Cache Problems: Penetration, Breakdown, Avalanche, Hot Key and Big Key

Systematic overview of the five most common Redis cache problems in high-concurrency scenarios: cache penetration, cache breakdown, cache avalanche, hot key, and big key. Analyzes the root cause of each problem and provides actionable solutions.

9/7/2024

Redis Distributed Lock: Optimistic Lock, WATCH and SETNX with Lua and Java

Redis optimistic lock in practice: WATCH/MULTI/EXEC mechanism explained, Lua scripts for atomic operations, SETNX+EXPIRE distributed lock from basics to Redisson, with complete Java code examples.

9/7/2024

Flink Time Semantics: EventTime, ProcessingTime, IngestionTime & Watermark Mechanism

Watermark is a special marker used to tell Flink the progress of events in the data stream. Simply put, Watermark is the 'current time' estimated by Flink in...

9/6/2024

Flink Watermark Complete Guide: Event Time Window, Out-of-Order & Late Data

Flink's Watermark mechanism is one of the most core concepts in event time window computation, used for handling out-of-order events and ensuring accurate...

9/6/2024

Flink Window Complete Guide: Tumbling, Sliding, Session

Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture. Flink treats batch as a special case of stream processing, using time windows (Tumbling, Sliding, Session) and count windows to split infinite streams into finite datasets.

9/5/2024

Flink Sliding Window Deep Dive: Principles, Use Cases & Implementation Examples

Sliding Window is one of the core mechanisms in Apache Flink stream processing, more flexible than fixed windows, widely used in real-time monitoring, anomaly...

9/5/2024

Flink JDBC Sink Deep Dive: MySQL Real-time Write, Batch Optimization & Best Practices

JDBC Sink is one of the most commonly used data output components, often used to write stream and batch processing results to relational databases like MySQL,...

9/4/2024

Flink Batch Processing DataSet API: Use Cases, Code Examples & Optimization Mechanisms

Flink's DataSet API is the core programming interface for batch processing, designed for processing static, bounded datasets, supporting TB to PB scale big...

9/4/2024

Redis Memory Management: Key Expiration and Eviction Policies

Comprehensive analysis of Redis memory control mechanisms, including maxmemory configuration, three key expiration deletion strategies (lazy/active/scheduled), and 8 memory eviction policies with applicable scenarios and selection guidance.

9/4/2024

Redis Communication Internals: RESP Protocol and Reactor Event-Driven Model

Deep dive into Redis communication internals: RESP serialization protocol five data types, Pipeline batch processing mode, and how the epoll-based Reactor single-threaded event-driven architecture supports Redis high-concurrency processing capability.

9/4/2024

Flink DataStream Transformation: Map, FlatMap, Filter to Window Complete Guide

Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios. Common operators include Map, FlatMap and...

9/3/2024

Flink Sink Usage Guide: Types, Fault Tolerance Semantics & Use Cases

Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media. It is the endpoint of streaming applications, determining how data is saved, transmitted or consumed.

9/3/2024

Flink Source Operator Deep Dive: Non-Parallel Source Principles & Use Cases

Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are processed sequentially.

9/2/2024

Flink SourceFunction to RichSourceFunction: Enhanced Source Functions & Practical Examples

RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.

9/2/2024

Flink on YARN Deployment: Environment Variables, Configuration & Resource Application

Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations. First, configure environment...

8/31/2024

Flink DataStream API: DataSource, Transformation & Sink Complete Guide

DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.

8/31/2024

Redis Persistence: RDB vs AOF Comparison and Production Strategy

Systematic comparison of Redis two persistence solutions: RDB snapshot and AOF log — configuration methods, trigger mechanisms, pros and cons, AOF rewrite mechanism, and recommended strategies for production environments.

8/31/2024

Redis RDB Persistence: Snapshot Principles, Configuration and Trade-offs

In-depth analysis of Redis RDB persistence mechanism, covering trigger methods, BGSAVE execution flow, configuration parameters, file structure, and comparison with AOF, helping you make informed persistence decisions in production environments.

8/31/2024

Flink Architecture Deep Dive: JobManager, TaskManager & Core Roles Overview

Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components. JobManager as Master is...

8/30/2024

Flink Installation & Deployment Guide: Local, Standalone & YARN Modes

Flink provides multiple installation modes to suit different scenarios. Local mode is suitable for personal learning and small-scale debugging with simple...

8/30/2024

Apache Flink Deep Dive: From Origin to Technical Features

Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data. With 'unified...

8/29/2024

Flink Stream-Batch Integration Introduction: Concept Analysis & WordCount Code Practice

Apache Flink supports both stream processing and batch processing. Stream processing is suitable for real-time data like sensors, logs or trading streams,...

8/29/2024

Redis Lua Scripts: EVAL, redis.call and Atomic Operations in Practice

Systematic explanation of Redis Lua script EVAL command syntax, differences between redis.call and redis.pcall, and four typical practical cases: atomic counter, CAS (Compare-And-Swap), batch operations, and distributed lock implementation using Lua scripts.

8/28/2024

Redis Slow Query Log and Performance Tuning in Production

Detailed explanation of Redis slow query log configuration parameters (slowlog-log-slower-than, slowlog-max-len), core commands, and production-grade performance tuning strategies including data structure optimization, Pipeline usage, and monitoring system setup.

8/28/2024

Spark Streaming Kafka Consumption: Offset Acquisition, Storage & Recovery Details

When Spark Streaming integrates with Kafka, Offset management is key to ensuring data processing continuity and consistency. Offset marks message position in...

8/27/2024

Spark Streaming Integration with Kafka: Offset Management Mechanism Details & Best Practices

Offset is used to mark message position in Kafka partition. Proper management can achieve at-least-once or even exactly-once data processing semantics. By persisting Offset, application can resume consumption from last processed position during fault recovery, avoiding message loss or duplication.

8/27/2024

Spark Streaming Stateful Transformations: Window Operations & State Tracking with Cases

Window operations integrate data from multiple batches over a longer time range by setting window length and slide duration. Cases demonstrate reduceByWindow...

8/26/2024

Spark Streaming Integration with Kafka: Receiver and Direct Approaches with Code Cases

This article introduces two Spark Streaming integration methods with Kafka: Receiver Approach and Direct Approach. Receiver uses Executor-based Receiver to...

8/26/2024

Redis Advanced Data Types: Bitmap, Geo and Stream

Deep dive into Redis three advanced data types: Bitmap, Geo (GeoHash, Z-order curve, Base32 encoding), and Stream message stream, with common commands and practical examples.

8/24/2024

Redis Pub/Sub: Mechanism, Weak Transaction and Risks

Detailed explanation of Redis Pub/Sub working mechanism, three weak transaction flaws (no persistence, no acknowledgment, no retry), and alternative solutions in production.

8/24/2024

Redis Single Node and Cluster Installation

Install Redis 6.2.9 from source on Ubuntu, configure redis.conf for daemon mode, start redis-server and verify connection via redis-cli.

8/21/2024

Redis Five Data Types: Complete Command Reference and Practical Scenarios

Comprehensive explanation of Redis five data types: String, List, Set, Sorted Set, and Hash. Includes common commands,底层特性, and typical usage scenarios with complete command examples.

8/21/2024

HBase Java API: Complete CRUD Code with Table Creation, Insert, Delete and Scan

Using HBase Java Client API to implement table creation, insert, delete, Get query, full table scan, and range scan. Includes complete Maven dependencies and runnable code examples covering all common HBase operations.

8/17/2024

Redis Introduction: Features and Architecture

Introduction to Redis: in-memory data structure store, key-value database, with comparison to traditional databases and typical use cases.

8/17/2024

HBase Cluster Deployment and High Availability Configuration

Complete HBase distributed cluster deployment: configure RegionServer on multiple nodes, HMaster high availability, integrate with ZooKeeper for coordination, with start/stop scripts and verification steps.

8/14/2024

HBase Shell CRUD Operations and Data Model

HBase Shell commands: create table, Put/Get/Scan/Delete operations, explain HBase data model with practical examples.

8/14/2024

HBase Overall Architecture: HMaster, HRegionServer and Data Model

Comprehensive analysis of HBase distributed database overall architecture, including ZooKeeper coordination, HMaster management node, HRegionServer data node, Region storage unit, and four-dimensional data model, suitable for big data architecture selection reference.

8/10/2024