Tag: Big Data
271 articles
AI Investigation #54: Big Data Industry Applications and Technology Selection Trends
Big data has achieved deep integration in finance, e-commerce, internet, communications, manufacturing, healthcare, education and other industries, becoming the core engi...
AI Investigation #53: Big Data Talent Landscape - Experience Distribution, Growth Paths and Industry Trends
The talent structure in the big data industry shows characteristics of youth and rapid growth. The 25-30 age group is the main force, while 30-35 year-olds are gradually...
AI Investigation #52: Big Data Technology Landscape - Lakehouse, Data Mesh and Serverless
Big data technology is undergoing a new wave of transformation. Lakehouse architecture combines the advantages of data lakes and data warehouses.
AI Investigation #51: Big Data Technology Evolution - Obsolete Frameworks, Architectures and the Reasons Behind Them
Big data technology evolution: MapReduce replaced by Spark, Storm replaced by Flink, Pig/Hive gradually phased out.
AI Investigation #50: Big Data Evolution - Two Decades of Architectural Change from Hadoop to Flink
Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing.
AI Research 49 - Big Data Survey Report: Development History from 1997 to 2025
Big data development began in 1997 when NASA proposed the concept, 2003-2006 Google published GFS, MapReduce, Bigtable three major papers leading distributed computing re...
Big Data 278 - Spark MLlib GBDT Case Study: Residuals, Regression Trees & Iterative Training
GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training.
Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting
Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles.
Spark MLlib GBDT Algorithm: Gradient Boosting Principles,and Applications
This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm.
Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Methods
This article systematically introduces ensemble learning methods in machine learning.
Spark MLlib Decision Tree Pruning: Pre-pruning, Post-Principles and Practice
This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms.
Spark MLlib Decision Tree: Classification Principles, Gini/Entropy and Practice
This article introduces the basic concepts, classification principles, and classification principles of decision trees.
Big Data 272 - Spark MLlib Logistic Regression: Basics, Input Function, Sigmoid & Loss
This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression.
Big Data 271 - Spark MLlib Linear Regression: Scenarios, Loss Function & Optimization
Linear regression uses regression equations to model relationships between independent and dependent variables.
Big Data 271 - Spark MLlib Logistic Regression: Sigmoid, Loss Function & Diabetes Prediction Case
Logistic Regression is a classification model in machine learning. Despite having "regression" in its name, it is a classification algorithm.
Spark MLlib Linear Regression: Scenarios, Loss Function and Optimization
Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent.
Big Data 265 - Canal Deployment
Canal is an open-source data synchronization tool from Alibaba for MySQL database incremental log parsing and synchronization.
Big Data 263 - Canal Working Principle: Workflow and MySQL Binlog Basics
Canal is an open-source tool for MySQL database binlog incremental subscription and consumption.
MySQL Binlog Deep Dive: Storage Directory, Change Records and Format
MySQL's Binary Log (binlog) is a log file type in MySQL that records all change operations performed on the database (excluding SELECT and SHOW queries).
Canal Data Sync: Introduction, Background, Principles and Architecture
Alibaba B2B's cross-region business between domestic sellers and overseas buyers drove the need for data synchronization between Hangzhou and US data centers.
Big Data 260 - Real-Time Data Warehouse: Background, Architecture and Requirements
Real-time data processing capability has become a key competitive factor for enterprises.
Big Data 259 - Griffin Configuration
Apache Griffin is an open-source data quality management framework designed to help organizations monitor and improve data quality in big data environments.
Big Data 258 - Griffin with Livy: Architecture, Installation, and Usage
Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios.
Big Data 257 - Data Quality Monitoring
Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection.
Flink CEP: Complex Event Processing & Pattern Matching
Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.
Big Data 256 - Atlas Installation
Metadata (MetaData) in the narrow sense refers to data that describes data.
Big Data 255 - Atlas Data Warehouse Metadata Management
Metadata, in its narrowest sense, refers to data that describes other data.
Flink Memory Management: Network Buffer, State Backend & Memory Model
Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.
Big Data 99 - Flink Parallelism: Operator Chaining, Slot and Resource Scheduling
Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.
Airflow Core Trade Task Scheduling Integration for Offline Data Warehouse
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.
Big Data 253 - Airflow Core Concepts
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.
Big Data 252 - Airflow Crontab Scheduling
Linux systems use the cron (crond) service for scheduled tasks, which is enabled by default.
Offline Data Warehouse ADS Layer and Airflow Task Task Scheduling
Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.
Big Data 251 - Airflow Installation
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks.
Big Data 96 - Flink Broadcast State: BroadcastState Practice and Rule Updates
Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadc...
Big Data 97 - Flink State Backend: State Storage and Performance Optimization
Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.
Big Data 247 - Offline Data Warehouse - Hive Order Zipper Table: Incremental Refresh Implementation
This article continues the zipper table practice, focusing on order history state incremental refresh.
Big Data 248 - Offline Data Warehouse: Dimension Tables
First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables.
Big Data 249 - Offline Data Warehouse DWD and DWS Layer
The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables.
Big Data 247 - Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh
This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking an...
Big Data 246 - Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script
userinfo (partitioned table) => userid, mobile, regdate => Daily changed data (modified + new) / Historical data (first day) userhis (zipper table) => Two additional fiel...
Big Data 245 - Offline Data Warehouse - Hive Zipper Table Getting Started: SCD Types, Table Creation and Loading
Slowly Changing Dimensions (SCD) refer to dimension attributes that change slowly over time in the real world (slow is relative to fact tables, where data changes faster...
Big Data 244 - Offline Data Warehouse: Hive ODS Layer
Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning.
Big Data 243 - Offline Data Warehouse: E-commerce Core Transaction Incremental Import
Scenario: Three core e-commerce transaction tables do daily incremental to offline data warehouse ODS, partitioned by dt Conclusion: DataX uses MySQLReader + HDFSWriter.
Big Data 241 - Offline Data Warehouse Practice: E-commerce Core Transaction Data Model & MySQL Source Table Design
Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).
Big Data 95 - Flink State and Checkpoint: State Management, Fault Tolerance and Savepoints
Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.
Offline Data Warehouse Advertising Business: Flume Import HDFS + ODS/DWD
Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date.
Offline Data Warehouse Advertising Business Hive Analysis: CTR/CVR/Top100
action: User behavior; 0 impression; 1 click after impression; 2 purchase duration: Stay duration shopid: Merchant id eventtype: "ad" adtype: Format type; 1 JPG; 2 PNG
Big Data 93 - Flink Streaming Introduction: DataStream API and Program Structure
This is article 93 in the Big Data series, introducing Flink DataStream API core concepts and program structure.
Flink Window and Watermark: Time Windows, Tumbling/Sliding/Session
Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing...
Big Data 237 - Offline Data Warehouse Hive Advertising Practice: ODS to DWD Event Parsing
This article introduces completing parsing, cleaning, and detail modeling from ODS to DWD for offline data warehouse based on advertising events in tracking logs.
Big Data 235 - Offline Data Warehouse Practice: Flume, HDFS and Hive for ODS, DWD, DWS and ADS
This article demonstrates a complete offline data warehouse pipeline from log collection to member metric analysis.
Big Data 91 - Flink Installation & Deployment: Local, Standalone and YARN Modes
Apache Flink is a distributed stream processing framework widely used for real-time data computing scenarios.
Flink on YARN Deployment: Environment Preparation, Resource Manager
Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and...
Big Data 233 - Offline Data Warehouse Retention Rate: DWS Modeling & ADS Hive Aggregation
Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dwsmemberretention_day table to join new member and startup detail tables to.
Big Data 232 - Hive New Member & Retention
Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'.
Big Data 90 - Apache Flink Introduction: Unified Stream-Batch Real-Time Computing
Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch p...
Offline Data Warehouse Hive Practice: DWD to DWS Daily/Weekly/Monthly Active Member
This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly).
Hive ODS Layer JSON Parsing: UDF Array Extraction, explode/json_tuple
JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL
Hive ODS Layer Practice: External Table Partition Loading and JSON Parsing
Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily.
Big Data 228 - Flume Taildir + Custom Interceptor: Extract JSON Timestamps, Mark Headers & Partition HDFS by Event Time
Apache Flume offline log collection implementation using Taildir Source and a custom Interceptor to extract JSON timestamps, mark headers, and route HDFS partitions by ev...
Flume 1.9.0 Custom Interceptor: TAILDIR Multi-directory Collection, Filter by Logtime
Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype
Big Data 226 - Flume Optimization for Offline Data Warehouse: batchSize, Channels, Compression, Interceptors & OOM Tuning
Flume 1.9.0 tuning guide for offline data warehouse log collection to HDFS, covering batch parameters, channel capacity and transaction sizing, JVM heap tuning...
Big Data 87 - Spark DStream Transformation Operators: map, reduceByKey and transform
Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blac...
Spark Streaming Window Operations & State Tracking: updateStateByKey & mapWithState
In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-stat...
Offline Data Warehouse Member Metrics Practice
Scenario: Use startup logs/event logs in offline data warehouse to count new, active (DAU/WAU/MAU), retention.
Big Data 223 - How to Build an Offline Data Warehouse: Tracking, Metrics and Thematic Analysis
Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.
Offline Data Warehouse Architecture Selection and Cluster Design
Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and.
Big Data 221 - Offline Data Warehouse Layering: ODS, DWD, DWS, DIM and ADS Architecture
Scenario: The more department-built data marts, the more inconsistent definitions, disconnected interfaces, forming data silos, and exploding data query costs.
Offline Data Warehouse Modeling Practice
In data warehouse architecture, Fact Table is the core table structure that stores business process metric values or facts.
Spark Streaming Introduction: From DStream to Structured Streaming
This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark's two generations of streaming frameworks.
Big Data 219 - Grafana 11.3.0 Installation & Startup: YUM, systemd and Login Setup
For OPs/devs still using CentOS/RHEL (including compatible distributions) in 2026, provides Grafana 11.3.0 (grafana-enterprise-11.3.0-1.x86_64.
Big Data 220 - Data Warehouse Introduction
In 1988, IBM first introduced the concept of "Information Warehouse" when facing increasingly scattered enterprise information systems and growing data silo problems.
Big Data 217 - Prometheus 2.53.2 Installation and Configuration Practice
Scenario: Single-machine deployment of Prometheus 2.53.2, pull node_exporter metrics from multiple hosts and verify Targets status.
Big Data 218 - Prometheus Node Exporter 1.8.2 and Pushgateway 1.10.0
Common Prometheus monitoring deployment: Install node_exporter-1.8.2 on Rocky Linux to expose host metrics, integrate with Prometheus scrape config, and visualize in Graf...
sklearn KMeans Key Attributes & Evaluation: cluster_cluster_centers_、inertia_、metrics
Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.
Big Data 216 - KMeans n_clusters Selection
KMeans nclusters selection method: calculate silhouettescore and silhouette_samples on candidate cluster numbers (e.g.
Big Data 213 - Python Hand-Written K-Means Clustering
Scenario: Hand-write K-Means using NumPy/Pandas, perform 3-class clustering on Iris.txt and output centroids with clustering results.
Big Data 214 - K-Means Clustering Practice: Self-Implemented Algorithm vs sklearn
K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification.
Big Data 211 - Scikit-Learn Logistic Regression Implementation
When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy.
Big Data 212 - K-Means Clustering Guide
K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed).
Big Data 209 - Deep Understanding of Logistic Regression
Logistic Regression (LR) is an important classification algorithm in machine learning.
Big Data 210 - How to Implement Logistic Regression in Scikit-Learn and Regularization Detailed (L1 and L2)
As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8.
SparkSQL Core Abstractions: RDD, DataFrame, Dataset & SparkSession
This is article 81 in the Big Data series, comprehensively introducing Spark's three core data abstractions' features, use cases and mutual conversions.
SparkSQL Operators: Transformation & Action Operations
This is article 82 in the Big Data series, systematically introducing SparkSQL Transformation and Action operators with complete test cases.
Big Data 207 - How to Handle Multicollinearity
When using scikit-learn for linear regression, how to handle multicollinearity in least squares method.
Big Data 208 - Ridge Regression and Lasso Regression
Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine.
Big Data 205 - Linear Regression Machine Learning Perspective
Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown
Big Data 206 - NumPy Matrix Multiplication Hand-written Multivariate Linear Regression
pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation).
Big Data 203 - sklearn Decision Tree Pruning Parameters
Common parameters for decision tree pruning (pre-pruning) in engineering: maxdepth, minsamplesleaf, minsamplessplit, maxfeatures, minimpuritydecrease.
Big Data 204 - Confusion Matrix to ROC: Imbalanced Binary Classification Metrics in sklearn
Confusion matrix (TP, FP, FN, TN) with unified metrics: Accuracy, Precision, Recall (Sensitivity), F1 Measure, ROC curve, AUC value, and practical business interpretation...
Spark Standalone Mode: Architecture & Performance Tuning
Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and...
SparkSQL Introduction: SQL & Distributed Computing Fusion
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integra...
Big Data 201 - Decision Tree from Split to Pruning
Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks.
Big Data 202 - sklearn Decision Tree Practice: criterion, Graphviz Visualization & Pruning
Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version).
Big Data 199 - Decision Tree Model Explained: Node Structure, Conditional Probability & Shannon Entropy
Tree model is a widely used algorithm type in supervised learning, can be applied to both classification and regression problems.
Big Data 200 - Decision Tree Information Gain Detailed
Scenario: Use information entropy/information gain to explain why decision tree selects certain column for splitting, and use Python to reproduce "best split column".
Big Data 197 - K-Fold Cross-Validation Practice
Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation.
Big Data 198 - KNN Must Normalize First: Min-Max Scaling, Data Leakage Pitfalls & sklearn Practice
In scikit-learn pipelines, distance-based models like KNN are highly sensitive to inconsistent feature scales. Split first, fit MinMaxScaler only on the training set...
Spark RDD Fault Tolerance: Checkpoint Principle & Best Best Practices
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long...
Spark Broadcast Variables: Efficient Shared Read-Only Data
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices.
Big Data 195 - KNN/K-Nearest Neighbors Algorithm Practice
KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python.
Big Data 196 - scikit-learn KNN Practice: KNeighborsClassifier, kneighbors & Learning Curves
Since being initiated in 2007 by David Cournapeau, scikit-learn (sklearn) has become one of the most important machine learning libraries in the Python ecosystem.
Big Data 193 - Apache Tez Practice: Hive on Tez Installation, DAG Principles & Common Pitfalls
Tez (pronounced "tez") is an efficient data processing framework running in the Hadoop ecosystem, designed to optimize batch processing and interactive queries.
Big Data 194 - Data Mining Overview: From Wine Classification to Supervised, Unsupervised & Reinforcement Learning
In a bar, there are ten almost identical glasses of wine on the counter. The boss says want to play a game, win and drink for free, lose and pay three times for the wine.
Big Data 191 - Elasticsearch Cluster Planning & Tuning: Node Roles, Shards, Replicas, Write and Search Checklist
Master / Data / Coordinating node responsibilities and production role isolation strategies, capacity planning calculations.
Big Data 192 - DataX 3.0 Architecture & Practice
Scenario: Offline sync MySQL/HDFS/Hive/OTS/ODPS and other heterogeneous data sources, batch migration and data warehouse ETL.
Spark Super WordCount: Text Cleaning & MySQL Persistence
This is article 75 in the Big Data series, on top of basic WordCount add text preprocessing and database persistence, build a near-production word frequency pipeline.
Spark Serialization & RDD Execution Principle
This is article 76 in the Big Data series, systematically reviewing Spark process communication mechanism, serialization strategy and RDD execution principle.
Big Data 189 - Nginx JSON Logs to ELK: ZK + Kafka + Elasticsearch 7.3.0 + Kibana 7.3.0
Configure Nginx logformat json to output structured accesslog (containing @timestamp, requesttime, status, requesturi, ua and other fields).
Filebeat → Kafka → Logstash → Elasticsearch Practice
Filebeat collects Nginx access.log to Kafka, and Logstash consumes, parses embedded JSON by field conditions, enriches metadata, and writes structured logs to Elasticsear...
Big Data 187 - Logstash Filter Plugin Practice
Filter is responsible for parsing, transforming, filtering events. Multiple Filters execute in configured order.
Big Data 188 - Logstash Output Plugin Practice
Output is the final stage of Logstash pipeline, responsible for outputting processed data to target system.
Big Data 185 - Logstash 7 Getting Started: stdin/file Collection, sincedb, start_position & Error Quick Reference
Logstash 7 getting started tutorial, covering stdin/file collection, sincedb mechanism and start_position effect conditions, with error quick reference table
Big Data 186 - Logstash JDBC vs Syslog Input: Principles, Scenarios & Reusable Configurations
Logstash Input plugin comparison, breakdown technical differences between JDBC Input and Syslog collection pipeline, applicable scenarios and key configs.
Spark Scala WordCount Implementation
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Spark Scala Practice: Pi Estimation & Mutual Friends
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two appr...
Big Data 183 - Elasticsearch Concurrency Conflicts & Optimistic Lock
Elasticsearch concurrency conflicts (inventory deduction read-modify-write) breakdown write overwrite cause, and gives engineering solution using ES optimistic.
Big Data 184 - Elasticsearch Doc Values Mechanism Detailed
Disk columnar data structure generated at indexing time, optimized for sorting, aggregation and script values
Big Data 181 - Elasticsearch Segment Merge & Disk Directory Breakdown
Explains why refresh causes small segment increase, how segment merge merges small segments into large ones in background and cleans deleted documents.
Big Data 182 - Elasticsearch Inverted Index Underlying Breakdown
Article details core data structure of Elasticsearch inverted index: Terms Dictionary, Posting List, FST (Finite State Transducer) and SkipList how accelerate.
Big Data 179 - Elasticsearch Inverted Index and Read/Write Process
This article deeply analyzes Elasticsearch's inverted index principle based on Lucene, and document read/write flow.
Big Data 180 - Elasticsearch Near Real-Time Search: Segment, Refresh and Flush
Article details core mechanism of Elasticsearch near real-time search, including Lucene Segment, Memory Buffer, File System Cache, Refresh, Flush and Translog.
Spark Action Operations Overview
This is article 72 in the Big Data series, systematically reviewing Spark RDD Action operators.
Elasticsearch Aggregation Practice: Metrics Aggregations & Bucket Aggregations
Covers complete practice of Metrics Aggregations and Bucket Aggregations, applicable to common Elasticsearch 7.x / 8.x versions in 2025.
Big Data 178 - Elasticsearch 7.3 Java Practice: Index and Document CRUD
This article details the complete flow for index and document CRUD operations using Elasticsearch 7.3.0 and RestHighLevelClient.
Big Data 175 - Elasticsearch Term Queries and Bool Combination Practice
This article demonstrates Elasticsearch term-level queries including term, terms, range, exists, prefix, regexp, fuzzy, ids queries, and bool compound queries.
Big Data 176 - Elasticsearch Filter DSL Practice: Filter Queries, Pagination and Highlighting
This article details practical usage of Elasticsearch Filter DSL, covering filter query, sort pagination, highlight display and batch operations.
Big Data 173 - Elasticsearch Mapping and Document CRUD Practice
After creating an index, need to set field constraints, called field mapping (mapping).
Elasticsearch Query DSL Practice: match/match_phrase/query_string/multi_match
In-depth explanation of core Query DSL usage in Elasticsearch 7.3, focusing on differences and pitfalls of match, matchphrase, querystring.
Spark Cluster Architecture & Deployment Modes
This is article 71 in the Big Data series, introducing Spark cluster core architecture, deployment mode comparisons, and static/dynamic resource management strategies.
Big Data 171 - Elasticsearch-Head and Kibana 7.3.0 Practice
Introduction to Elasticsearch-Head plugin and Kibana 7.3.0 installation and connectivity points, covering Chrome extension quick access.
Elasticsearch Index Operations & IK Analyzer Practice: 7.3/8.x
This article explains Elasticsearch index CRUD operations and IK analyzer config, covering versions 7.3.0 and 8.15.0.
Big Data 169 - Elasticsearch Getting Started: Index/Document CRUD & Minimum Search Examples
Elasticsearch (ES 7.x/8.x) minimum examples for index creation, document CRUD, query by ID, and _search, with response samples and screenshots to quickly run through the...
Big Data 170 - Elasticsearch 7.3.0 Three-Node Cluster Practice
Elasticsearch 7.3.0 three-node cluster deployment practice tutorial, covering directory creation and permission settings.
Big Data 167 - ELK Elastic Stack Practice: Architecture, Indexing and Troubleshooting
Article introduces core capabilities and common practices of Elasticsearch 8.x, Logstash 8.x, Kibana 8.
Elasticsearch Single Machine Cloud Server Deployment & Operations
Elasticsearch is a distributed full-text search engine, supports single-node mode and cluster mode deployment. Generally, small companies can use Single-Node Mode for the...
Big Data 165 - Apache Kylin Cube7 Practice: Aggregation Group, RowKey and Encoding
Covers Aggregation Group, Mandatory Dimension, Hierarchy Dimension, Joint Dimension usage trade-offs, and explains impact of dictionary encoding, RowKey order.
Apache Kylin 1.6 Streaming Cubing Practice: Kafka to Minute-level OLAP
Kafka→Kylin real-time OLAP pipeline, providing minute-level aggregation queries for common 2025 business scenarios (e-commerce transactions, user behavior...
Spark RDD Deep Dive: Five Key Features
This is article 69 in the Big Data series, deeply analyzing RDD, Spark's core data abstraction, its five key features and design principles.
Spark RDD Creation & Transformation Operations
This is article 70 in the Big Data series, comprehensively explaining Spark RDD's three creation methods and practical usage of common Transformation operators.
Apache Kylin Segment Merge Practice: Manual/Auto Merge, Retention Threshold
Apache Kylin Segment merge practice tutorial, covering manual MERGE Job flow, continuous Segment requirements, Auto Merge multi-level threshold strategy...
Big Data 164 - Apache Kylin Cuboid Pruning Practice: Derived Dimensions & Expansion Control
Cuboid pruning optimization: When there are many dimensions, Cuboid count grows exponentially, causing long build time and storage expansion.
Big Data 161 - Apache Kylin Cube Practice: Modeling, Building and Query Acceleration
Apache Kylin 4.0 Cube modeling and query acceleration method: Complete star modeling with fact tables and dimension tables, design dimensions and measures.
Apache Kylin Incremental Cube & Segment Practice: Daily Partition Column
Using date field of Hive partitioned table as Partition Date Column, split Cube into multiple Segments, incrementally build by range to avoid repeated computation of hist...
Apache Kylin Cube Practice: Hive Load & Pre-computation Acceleration
Apache Kylin is an open-source distributed analysis engine, focused on providing real-time OLAP (Online Analytical Processing) capabilities for big data.
Apache Kylin Cube Practice: From Modeling to Build and Query
Scenario: Using e-commerce sales fact table, pre-compute aggregation queries accelerated by "date" dimension on Kylin.
From MapReduce to Spark: Big Data Computing Evolution
Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core compon...
Apache Kylin Comprehensive Guide: MOLAP Architecture, Hive Integration
Background, evolution and engineering practice of Apache Kylin, focusing on MOLAP solution implementation path for massive data analysis.
Big Data 158 - Apache Kylin 3.1.1 Deployment on Hadoop, Hive and HBase
Complete deployment record of Apache Kylin 3.1.1 on Hadoop 2.9.2, Hive 2.3.9, HBase 1.3.1, Spark 2.4.5 (without-hadoop.
Kafka Storage Mechanism: Log Segmentation & Retention
This is article 65 in the Big Data series, deeply analyzing Kafka's log storage mechanism.
Kafka High Performance: Zero-Copy, mmap & Sequential Write
This is article 66 in the Big Data series, deeply analyzing Kafka's underlying I/O optimization technologies achieving extremely high throughput.
Kafka Replica Mechanism: ISR & Leader Election
Deep dive into Kafka replica mechanism, including ISR sync node set maintenance, Leader election process, and unclean election trade-offs between consistency and availabi...
Kafka Exactly-Once: Idempotence & Transactions
Systematic explanation of how Kafka achieves Exactly-Once semantics through idempotent producers and transactions, covering PID/sequence number principle...
Big Data 155 - Apache Druid Storage & Query Architecture: Segment, Chunk, Roll-up & Bitmap Indexes
Apache Druid data storage and high-performance query path: from DataSource/Chunk/Segment layering, to columnar storage, Roll-up pre-aggregation, Bitmap.
Big Data 156 - Apache Druid + Kafka Real-time Analysis: JSON Flattening, Ingestion & SQL Metrics
Scala Kafka Producer writes order/click data to Kafka Topic (example topic: druid2), continuous ingestion in Druid through Kafka Indexing Service.
Big Data 153 - Apache Druid Real-time Kafka Ingestion: Complete Practice from Ingestion to Query
Complete practice of Apache Druid real-time Kafka ingestion, using network traffic JSON as example, completing data ingestion through Druid console's Streaming/Kafka wiza...
Apache Druid Architecture & Component Responsibilities: Coordinator/Overlord
Apache Druid component responsibilities and deployment points from 0.13.0 to current (2025): Coordinator manages Historical node Segment.
Apache Druid Cluster Deployment [Part 1]: MySQL Metadata Store
Scenario: 2C4G/2C2G three-node mixed deployment, Druid 30.0.0, Kafka/HDFS/MySQL collaboration. Conclusion: Can run on low config, but core is DirectMemory and processing.
Apache Druid Cluster Mode [Part 2]: Low-Memory Cluster Practice
Low-memory cluster practice for Apache Druid 30.0.0 on three nodes: provides JVM parameters and runtime.
Kafka Topic, Partition & Consumer: Rebalance Optimization
Deep dive into Kafka Topic, Partition, Consumer Group core mechanisms, covering custom deserialization, offset management and rebalance optimization configuration.
Kafka Topic Management: Commands & Java API
Comprehensive introduction to Kafka Topic operations, including kafka-topics.sh commands, replica assignment strategy principles, and KafkaAdminClient Java API core usage.
Apache Druid Real-time OLAP Architecture & Selection Points
Apache Druid real-time OLAP practice: suitable for event detail with time as primary key, sub-second aggregation and high-concurrency self-service analysis.
Big Data 150 - Apache Druid Single-Machine Deployment: Architecture Overview and Startup
Scenario: Quickly experience Apache Druid 30.0.0 locally/single-machine, verify real-time and historical queries and console access.
Big Data 148 - Flink Write to Kudu: Custom Sink Full Practice
Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test).
Kafka Producer Interceptor & Interceptor Chain
Introduction to Kafka 0.10 Producer interceptor mechanism, covering onSend and onAcknowledgement interception points, interceptor chain execution order and error isolatio...
Big Data 60 - Kafka Consumer: Consumption Flow, Heartbeat and Parameter Tuning
Detailed explanation of Kafka Consumer Group consumption model, partition assignment strategy, heartbeat keep-alive mechanism, and tuning practices for key parameters lik...
Apache Kudu Docker Quick Deployment: 3 Master/5 TServer Pattern
Apache Kudu Docker Compose quick deployment solution on Ubuntu 22.04 cloud host, covering Kudu Master and Tablet Server components.
Big Data 147 - Java Access Apache Kudu: From Table Creation to CRUD
Java client (kudu-client 1.4.0) connects to Apache Kudu with multiple Masters (example ports 7051/7151/7251), completes full process of table creation.
Apache Kudu: Real-time Write + OLAP Architecture, Performance
Apache Kudu is an open-source storage engine developed by Cloudera and contributed to Apache Software Foundation.
Apache Kudu Architecture & Practice: RowSet, Partition & Raft Consensus
Apache Kudu's Master/TabletServer architecture, RowSet (MemRowSet/DiskRowSet) write/read path, MVCC, and Raft consensus role in replica and failover
ClickHouse MergeTree Partition/TTL, Materialized View, ALTER
ClickHouse is a columnar database for OLAP (Online Analytical Processing), favored in big data analysis for its high-speed data processing.
Kafka Producer Message Sending Flow & Core Parameters
Deep analysis of Kafka Producer initialization, message interception, serialization, partition routing, buffer batch sending, ACK confirmation and complete sending chain.
Kafka Serialization & Partitioning: Custom Implementation
Deep dive into Kafka message serialization and partition routing, including complete code for custom Serializer and Partitioner, mastering precise message routing and eff...
Big Data 141 - ClickHouse Replicas: ReplicatedMergeTree and ZooKeeper
ReplicatedMergeTree ZooKeeper: Implements communication between multiple instances.
ClickHouse Sharding × Replica × Distributed: ReplicatedMergeTree
Replica refers to storing the same data on different physical nodes in a distributed system. Its core idea is to improve system reliability through data redundancy.
Big Data 139 - ClickHouse MergeTree Best Practices: Replacing Deduplication, Summing Aggregation, Partition Design & Materialized View Alternatives
Scenario: Solve two common "quasi-real-time detail table" requirements: deduplication/update and key-based summing.
Big Data 140 - ClickHouse CollapsingMergeTree & External Data Sources
ClickHouse external data source engine guide: DDL templates, key parameters and read/write pipelines for ENGINE=HDFS, ENGINE=MySQL, ENGINE=Kafka, and distributed table co...
Big Data 137 - ClickHouse MergeTree Practical Guide
ClickHouse MergeTree key mechanisms: batch writes form parts, background merge (Compact/Wide two part forms).
Big Data 138 - ClickHouse MergeTree Deep Dive: Partition Pruning × Sparse Primary Index × Marks × Compression
ClickHouse MergeTree storage and query path: column files (*.bin), sparse primary index (primary.idx), marker files (.mrk/.
Kafka Operations: Shell Commands & Java Client Examples
Covers Kafka daily operations: daemon startup, Shell topic management commands, and Java client programming (complete Producer/Consumer code) with key configuration param...
Spring Boot Integration with Kafka
Detailed guide on integrating Kafka in Spring Boot projects, including dependency configuration, KafkaTemplate sync/async message sending, and complete @KafkaListener con...
Spark Distributed Environment Setup
Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.
Big Data 135 - ClickHouse Cluster Connectivity Self-Check & Data Types Guide | Run ON CLUSTER in 10 Minutes
Using three-node cluster (h121/122/123) as example, first complete cluster connectivity self-check: system.
Big Data 136 - ClickHouse Table Engines: TinyLog/Log/StripeLog/Memory/Merge Selection Guide
Scenario: Need to trade-off among small data/temporary table/log landing/multi-table combined reads, often using MergeTree is "using a cannon to kill a mosquito".
Kafka Components: Producer, Broker, Consumer Full Flow
Deep dive into Kafka's three core components: Producer partitioning strategy and ACK mechanism, Broker Leader/Follower architecture, Consumer Group partition assignment a...
Kafka Installation: From ZooKeeper to KRaft Evolution
Introduction to Kafka 2.x vs 3.x core differences, detailed cluster installation steps, ZooKeeper configuration, Broker parameter settings, and how KRaft mode replaces Zo...
Big Data 133 - ClickHouse Concepts & Basics | Why Fast? Columnar + Vectorized + MergeTree Comparison
Scenario: Want high-concurrency low-latency OLAP, and don't want to use entire Hadoop/lakehouse.
Big Data 134 - ClickHouse Single Machine + Cluster Node Deployment Guide | Installation Configuration | systemd Management / config.d
Official recommended keyring + signed-by installation of ClickHouse on Ubuntu, start with systemd and self-check
Big Data 131 - Flink CEP Practice: 24 Hours ≥5 Transactions & 10 Minutes Unpaid Detection Cases
Flink CEP (Complex Event Processing) is an extension library provided by Apache Flink for real-time complex event processing.
Big Data 132 - Flink SQL Quick Start | Table API + SQL in 3 Minutes with toChangelogStream New Syntax
Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE).
Flink CEP Deep Dive: Complex Event Processing Complete Guide
Flink CEP (Complex Event Processing) is a core component of Apache Flink, specifically designed for processing complex event streams.
Flink CEP Timeout Event Extraction: Complete Guide with Matched and Timed-out Events
Flink CEP timeout event extraction is a key step in stream processing, used to capture partial matching events that exceed the window time (within) during pattern matchin...
Redis High Availability: Master-Slave Replication & Sentinel
This is article 51 in the Big Data series, covering Redis high availability architecture: master-slave replication, Sentinel mode, and distributed lock design.
Kafka Architecture: High-Throughput Distributed Messaging
Systematic introduction to Kafka core architecture: Topic/Partition/Replica model, ISR mechanism, zero-copy optimization, message format and typical use cases.
Flink StateBackend Deep Dive: Memory, Fs, RocksDB and Operator State
ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale.
Flink Parallelism Deep Dive: From Concepts to Best Practices
Basic Concept of Parallelism In Apache Flink, Parallelism refers to the number of parallel tasks that can run simultaneously for each operator during execution.
Big Data 125 - Flink Broadcast State: Dynamic Logic Updates in Real-Time Streaming
Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications.
Big Data 126 - Flink State Backend: Memory, Fs, RocksDB and Performance Differences
State Storage Methods: MemoryStateBackend: Stores state in TaskManager's Java memory. Fast but limited (5MB per state default, 10MB per task).
Flink Parallelism Setting Priority: Principles, Configuration and Tuning
A Flink program consists of multiple Operators (Source, Transformation, Sink).
Big Data 124 - Flink State: Keyed State, Operator State and KeyGroups
Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap.
Redis Cache Problems: Penetration, Breakdown, Avalancheand Solutions
Systematic overview of the five most common Redis cache problems in high-concurrency scenarios: cache penetration, cache breakdown, cache avalanche, hot key, and big key.
Big Data 50 - Redis Distributed Lock: Optimistic Lock, WATCH and SETNX
Redis optimistic lock in practice: WATCH/MULTI/EXEC mechanism explained, Lua scripts for atomic operations, SETNX+EXPIRE distributed lock from basics to Redisson...
Big Data 121 - Flink Time Semantics: EventTime, ProcessingTime, IngestionTime & Watermarks
Watermark is a special marker used to tell Flink the progress of events in the data stream.
Big Data 122 - Flink Watermark Guide: Event Time, Out-of-Order Data and Late Events
When using event-time based windows, Flink relies on Watermark to decide when to trigger window computation.
Flink Window Complete Guide: Tumbling, Sliding, Session
Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture.
Flink Sliding Window Deep Dive: Principles, Use Cases and Implementation
Sliding window is a more generalized form of fixed window, achieving dynamic window movement through introducing slide interval. It consists of two key parameters
Flink JDBC Sink Deep Dive: MySQL Real-time Write, Batch Output and Retry
In Apache Flink, JDBC Sink is an important data output component that allows writing stream or batch processed data to relational databases through JDBC connections.
Flink Batch Processing DataSet API: Use Cases, Code Examples and Core Operators
Apache Flink's DataSet API is the core programming interface for Flink batch processing, specifically designed for processing static, bounded datasets.
Redis Memory Management: Key Expiration and Eviction Policies
Comprehensive analysis of Redis memory control mechanisms, including maxmemory configuration, three key expiration deletion strategies (lazy/active/scheduled).
Big Data 48 - Redis Communication Internals: RESP Protocol and Reactor Model
This is article 48 in the Big Data series. This article provides an in-depth analysis of Redis communication protocol RESP and Reactor-based event-driven architecture.
Big Data 115 - Flink DataStream Transformation: Map, FlatMap and Filter
Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios.
Big Data 116 - Flink Sink Usage Guide: Types, Fault Tolerance Semantics & Scenarios
Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media.
Flink Source Operator Deep Dive: Non-Parallel Source Principles
Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are proce...
Flink SourceFunction to RichSourceFunction: Enhanced Source Lifecycle and Resource Management
RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.
Big Data 111 - Flink on YARN Deployment: Environment Variables, Configuration & Resource Requests
Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations.
Flink DataStream API: DataSource, Transformation and Sink Components
DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.
Redis Persistence: RDB vs AOF Comparison and Production Settings
Systematic comparison of Redis two persistence solutions: RDB snapshot and AOF log — configuration methods, trigger mechanisms, pros and cons, AOF rewrite mechanism.
Big Data 46 - Redis RDB Persistence: Snapshot Principles, Configuration and Tradeoffs
In-depth analysis of Redis RDB persistence mechanism, covering trigger methods, BGSAVE execution flow, configuration parameters, file structure, and comparison with AOF.
Flink Architecture Deep Dive: JobManager, TaskManager and Client
Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components.
Big Data 110 - Flink Installation and Deployment Guide: Local, Standalone and YARN
Flink provides multiple installation modes to suit different scenarios.
Apache Flink Deep Dive: From Origin to Technical Features
Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data.
Big Data 108 - Flink Stream-Batch Integration: Concepts & WordCount Practice
Definition: Stream processing means real-time processing of continuously flowing data streams.
Redis Lua Scripts: EVAL, redis.call and Atomic Operations
Systematic explanation of Redis Lua script EVAL command syntax, differences between redis.call and redis.
Redis Slow Query Log and Performance Tuning in Production
Detailed explanation of Redis slow query log configuration parameters (slowlog-log-slower-than, slowlog-max-len), core commands.
Spark Streaming Kafka Consumption: Offset Acquisition, Storage and Management
When Spark Streaming integrates with Kafka, Offset management is key to ensuring data processing continuity and consistency.
Big Data 104 - Spark Streaming with Kafka: Offset Management Mechanisms & Best Practices
Offset is used to mark message position in Kafka partition. Proper management can achieve at-least-once or even exactly-once data processing semantics.
Spark Streaming Stateful Transformations: Window Operations and State Tracking
Window operations integrate data from multiple batches over a longer time range by setting window length and slide duration.
Big Data 102 - Spark Streaming with Kafka: Receiver and Direct Approaches
This article introduces two Spark Streaming integration methods with Kafka: Receiver Approach and Direct Approach.
Redis Advanced Data Types: Bitmap, Geo and Stream
Deep dive into Redis three advanced data types: Bitmap, Geo (GeoHash, Z-order curve, Base32 encoding), and Stream message stream, with common commands and practical examp...
Redis Pub/Sub: Mechanism, Weak Transaction and Risks
Detailed explanation of Redis Pub/Sub working mechanism, three weak transaction flaws (no persistence, no acknowledgment, no retry), and alternative solutions in producti...
Redis Single Node and Cluster Installation
Install Redis 6.2.9 from source on Ubuntu, configure redis.conf for daemon mode, start redis-server and verify connection via redis-cli.
Big Data 40 - Redis Five Data Types: Command Reference and Practice
Comprehensive explanation of Redis five data types: String, List, Set, Sorted Set, and Hash. Includes common commands, underlying characteristics, and typical usage scena...
Big Data 37 - HBase Java API: Complete CRUD Code with Table Creation
Using HBase Java Client API to implement table creation, insert, delete, Get query, full table scan, and range scan.
Redis Introduction: Features and Architecture
Introduction to Redis: in-memory data structure store, key-value database, with comparison to traditional databases and typical use cases.
HBase Cluster Deployment and High Availability Configuration
This is article 35 in the Big Data series. Complete HBase distributed cluster deployment on three-node Hadoop + ZooKeeper cluster.
HBase Shell CRUD Operations and Data Model
HBase Shell commands: create table, Put/Get/Scan/Delete operations, explain HBase data model with practical examples.
Big Data 33 - HBase Overall Architecture: HMaster, HRegionServer and Data Model
Comprehensive analysis of HBase distributed database overall architecture, including ZooKeeper coordination, HMaster management node, HRegionServer data node...
HBase Single Node Configuration: hbase-env and hbase-site.xml
Step-by-step configure HBase single node environment, explain hbase-env.sh, hbase-site.xml key parameters, complete integration with Hadoop HDFS and ZooKeeper cluster.
ZooKeeper Leader Election and ZAB Protocol Principles
This is article 31 in the Big Data series. Deep analysis of ZooKeeper Leader election mechanism and ZAB (ZooKeeper Atomic Broadcast) protocol implementation principles.
ZooKeeper Distributed Lock Java Implementation Details
This is article 32 in the Big Data series. Demonstrates how to implement fair distributed lock using ZooKeeper ephemeral sequential nodes, with complete Java code.
ZooKeeper Watcher Principle and Command Line Practice Guide
Complete analysis of Watcher registration-trigger-notification flow from client, WatchManager to ZooKeeper server, and zkCli command line practice demonstrating node CRUD...
ZooKeeper Java API Practice: Node CRUD and Monitoring
Use ZkClient library to operate ZooKeeper via Java code, complete practical examples of session establishment, persistent node CRUD, child node change monitoring...
ZooKeeper Cluster Configuration Details and Startup Verification
Deep dive into zoo.cfg core parameter meanings, explain myid file configuration specifications, demonstrate 3-node cluster startup process and Leader election result veri...
ZooKeeper ZNode Data Structure and Watcher Mechanism Details
Deep dive into ZooKeeper's four ZNode node types, ZXID transaction ID structure, and one-time trigger Watcher monitoring mechanism principles and practice.
Sqoop Incremental Import and CDC Change Data Capture Principles
Introduce Sqoop's --incremental append incremental import mechanism, and deeply explain CDC (Change Data Capture) core concepts, capture method comparisons...
ZooKeeper Distributed Coordination Framework Introduction and ZAB Protocol
Introduction to ZooKeeper core concepts, Leader/Follower/Observer role division, ZAB protocol principles, and demonstration of 3-node cluster installation and configurati...
Big Data 23 - Sqoop Partial Import: --query, --columns and --where
Detailed explanation of three ways Sqoop imports partial data from MySQL to HDFS by condition: custom query, specify columns, WHERE condition filtering, with applicable s...
Sqoop and Hive Integration: MySQL ↔ Hive Bidirectional Data Transfer
Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usa...
Sqoop Data Migration ETL Tool Introduction and Installation
Introduction to Apache Sqoop core principles, use cases, and installation configuration steps on Hadoop cluster, helping quickly get started with batch data migration bet...
Sqoop Practice: MySQL Full Data Import to HDFS
Complete example demonstrating Sqoop importing MySQL table data to HDFS, covering core parameter explanations, MapReduce parallel mechanism, and execution result verifica...
Flume Collect Hive Logs to HDFS
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log dat...
Flume Dual Sink: Write Logs to Both HDFS and Local File
This is article 20 in the Big Data series. Demonstrates Flume replication mode with dual Sink architecture—same data written to both HDFS and local filesystem.
Apache Flume Architecture and Core Concepts
Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.
Flume Hello World: NetCat Source + Memory Channel + Logger Sink
Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→...
Hive Metastore Three Modes and Remote Deployment
Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cl...
HiveServer2 Configuration and Beeline Remote Connection
Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.
Hive DDL and DML Operations
Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations.
Hive HQL Advanced: Data Import/Export and Query Practice
Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation...
MapReduce JOIN Four Implementation Strategies
This is article 11 in the Big Data series. Introduces four classic strategies for implementing multi-table JOIN in MapReduce framework and their Java implementations.
Hive Introduction: Architecture and Cluster Installation
Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop clu...
HDFS Java Client Practice: Upload/Download Files, Directory Operations and API Usage
This is article 9 in the Big Data series. Learn to operate HDFS through Java code, master Hadoop's Java Client API.
Java Implementation MapReduce WordCount Complete Code
Implement Hadoop MapReduce WordCount from scratch: Hadoop serialization mechanism detailed explanation, writing Mapper, Reducer, Driver three components...
HDFS Distributed File System Read/Write Principle
Deep dive into HDFS architecture: NameNode, DataNode, Client roles, Block storage mechanism, file read/write process (Pipeline write and nearest read), and HDFS basic com...
HDFS CLI Practice Complete Command Guide
Complete HDFS CLI practice: hadoop fs common commands including directory operations, file upload/download, permission management, with three-node cluster live demo.
Hadoop Cluster WordCount Distributed Computing Practice
Complete WordCount execution on Hadoop cluster: upload files to HDFS, submit MapReduce job, view running status through YARN UI, verify true distributed computing.
Hadoop JobHistoryServer Configuration and Log Aggregation
Configure Hadoop JobHistoryServer to record MapReduce job execution history, enable YARN log aggregation, view job details and logs via Web UI.
Hadoop Cluster SSH Passwordless Login Configuration and Distribution Script
Complete guide for Hadoop three-node cluster SSH passwordless login: generate RSA keys, distribute public keys, write rsync cluster distribution script.
Hadoop Cluster Startup and Web UI Verification
Complete startup process for Hadoop three-node cluster: format NameNode, start HDFS and YARN, verify cluster status via Web UI, including start-dfs.sh and start-yarn.
Basic Environment Setup: Hadoop Cluster
This article is migrated from Juejin. Original link: Big Data 01 - Basic Environment Setup
Hadoop Cluster XML Configuration Details
Detailed explanation of Hadoop cluster three-node XML configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.