Blog
Technical exploration and thoughts · 655 articles
Elasticsearch Cluster Planning & Tuning: Node Roles, Shar...
Master / Data / Coordinating node responsibilities and production role isolation strategies, capacity planning calculations (JVM Heap 30-32GB limit, hot/cold data with disk/IO constraints, horizont...
DataX 3.0 Architecture & Practice: Reader/Writer Plugin M...
DataX (DataX 3.0) is an offline data synchronization/data integration tool widely used and open-sourced within Alibaba, for enterprise-level heterogeneous data...
Spark Super WordCount: Text Cleaning & MySQL Persistence
Implement complete production-ready word frequency pipeline: lowercase conversion, punctuation removal, stop word filtering, word frequency counting, finally efficiently write to MySQL via foreachP...
Spark Serialization & RDD Execution Principle
Deep dive into Spark Driver-Executor process communication, Java/Kryo serialization selection, closure serialization problem troubleshooting, and RDD dependencies, Stage division and persistence st...
Nginx JSON Logs to ELK: ZK+Kafka+Elasticsearch 7.3.0+Kiba...
Configure Nginx log_format json to output structured access_log (containing @timestamp, request_time, status, request_uri, ua and other fields), start...
Filebeat → Kafka → Logstash → Elasticsearch Practice
Filebeat collects Nginx access.log and writes to Kafka, Logstash consumes from Kafka and parses message embedded JSON by field (app/type) conditions, adds...
Logstash Filter Plugin Practice: grok Parsing Console & N...
Article explains using grok in Logstash 7.3.0 environment to extract structured fields from console stdin and Nginx access logs (IP, time_local, method, request, status etc), and quickly verify par...
Logstash Output Plugin Practice: stdout/file/Elasticsearc...
Logstash Output plugin (Logstash 7.3.0) practical tutorial, covering stdout (rubydebug) for debugging, file output for local archiving, Elasticsearch output...
Logstash 7 Getting Started: stdin/file Collection, sinced...
Logstash 7 getting started tutorial, covering stdin/file collection, sincedb mechanism and start_position effect conditions, with error quick reference table
Logstash JDBC vs Syslog Input: Principle, Scenario Compar...
Logstash Input plugin comparison, breakdown technical differences between JDBC Input and Syslog collection pipeline, applicable scenarios and key configs. JDBC...
Spark Scala WordCount Implementation
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Spark Scala Practice: Pi Estimation & Mutual Friends
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two approaches, comparing Cartesian...
Elasticsearch Concurrency Conflicts & Optimistic Lock, Di...
Elasticsearch concurrency conflicts (inventory deduction read-modify-write) breakdown write overwrite cause, and gives engineering solution using ES optimistic...
Elasticsearch Doc Values Mechanism Detailed: Columnar Sto...
Disk columnar data structure generated at indexing time, optimized for sorting, aggregation and script values; most supported types enabled by default, text fields don't provide doc values by defau...
Elasticsearch Segment Merge & Disk Directory Breakdown: M...
Explains why refresh causes small segment increase, how segment merge merges small segments into large ones in background and cleans deleted documents, why too...
Elasticsearch Inverted Index Underlying Breakdown: Terms ...
Article details core data structure of Elasticsearch inverted index: Terms Dictionary, Posting List, FST (Finite State Transducer) and SkipList how accelerate...
Elasticsearch Inverted Index & Read/Write Process Full An...
Article analyzes Elasticsearch inverted index principle based on Lucene, compares forward index vs inverted index differences, covering core concepts like...
Elasticsearch Near Real-time Search: Segment, Refresh, Fl...
Article details core mechanism of Elasticsearch near real-time search, including Lucene Segment, Memory Buffer, File System Cache, Refresh, Flush and Translog...
Spark Action Operations Overview
Comprehensive introduction to Spark RDD Action operations, covering data collection, statistical aggregation, element retrieval, storage output categories, and detailed explanation of Key-Value RDD...
Elasticsearch Aggregation Practice: Metrics Aggregations ...
Covers complete practice of Metrics Aggregations and Bucket Aggregations, applicable to common Elasticsearch 7.x / 8.x versions in 2025. Article starts with...