Basic Environment Setup: Hadoop Cluster
Detailed tutorial on setting up Hadoop cluster environment on 3 cloud servers (2C4G configuration), including HDFS, MapReduce, YARN components introduction, Java and Hadoop environment configuration steps.
Hadoop / Hive / Kafka / Spark / Flink 全栈大数据工程实战,从环境搭建到生产落地。
277 articles
Detailed tutorial on setting up Hadoop cluster environment on 3 cloud servers (2C4G configuration), including HDFS, MapReduce, YARN components introduction, Java and Hadoop environment configuration steps.
Detailed explanation of Hadoop cluster three-node XML configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, including NameNode, DataNode, ResourceManager configuration instructions.
Complete guide for Hadoop three-node cluster SSH passwordless login: generate RSA keys, distribute public keys, write rsync cluster distribution script, including pitfall notes and /etc/hosts configuration points.
Complete startup process for Hadoop three-node cluster: format NameNode, start HDFS and YARN, verify cluster status via Web UI, including start-dfs.sh and start-yarn.sh usage.
Complete WordCount execution on Hadoop cluster: upload files to HDFS, submit MapReduce job, view running status through YARN UI, verify true distributed computing.
Configure Hadoop JobHistoryServer to record MapReduce job execution history, enable YARN log aggregation, view job details and logs via Web UI.
Deep dive into HDFS architecture: NameNode, DataNode, Client roles, Block storage mechanism, file read/write process (Pipeline write and nearest read), and HDFS basic commands.
Complete HDFS CLI practice: hadoop fs common commands including directory operations, file upload/download, permission management, with three-node cluster live demo.
Using Hadoop HDFS Java Client API for file operations: Maven dependency configuration, FileSystem/Path/Configuration core classes, implement file upload, download, delete, list scan and progress bar display.
Implement Hadoop MapReduce WordCount from scratch: Hadoop serialization mechanism detailed explanation, writing Mapper, Reducer, Driver three components, Maven project configuration, local and cluster run complete code.
Deep dive into four JOIN strategies in MapReduce: Reduce-Side Join, Map-Side Join, Semi-Join, and Bloom Join principles and Java implementations, with analysis of applicable scenarios and performance characteristics.
Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop cluster.
Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations, with complete HiveQL examples and configuration optimization.
Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation, filtering, and sorting.
Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cluster.
Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.
Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.
Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→Channel→Sink data flow.
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log data landing to HDFS.
Through Flume replication mode (Replicating Channel Selector) and three-Agent cascade architecture, implement same log data written to both HDFS and local file, meeting both offline analysis and real-time backup needs.
Introduction to Apache Sqoop core principles, use cases, and installation configuration steps on Hadoop cluster, helping quickly get started with batch data migration between MySQL and HDFS/Hive.
Complete example demonstrating Sqoop importing MySQL table data to HDFS, covering core parameter explanations, MapReduce parallel mechanism, and execution result verification.
Detailed explanation of three ways Sqoop imports partial data from MySQL to HDFS by condition: custom query, specify columns, WHERE condition filtering, with applicable scenarios and precautions.
Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usage.
Introduce Sqoop's --incremental append incremental import mechanism, and deeply explain CDC (Change Data Capture) core concepts, capture method comparisons, and modern solutions like Flink CDC, Debezium.
Introduction to ZooKeeper core concepts, Leader/Follower/Observer role division, ZAB protocol principles, and demonstration of 3-node cluster installation and configuration process.
Deep dive into zoo.cfg core parameter meanings, explain myid file configuration specifications, demonstrate 3-node cluster startup process and Leader election result verification.
Deep dive into ZooKeeper's four ZNode node types, ZXID transaction ID structure, and one-time trigger Watcher monitoring mechanism principles and practice.
Complete analysis of Watcher registration-trigger-notification flow from client, WatchManager to ZooKeeper server, and zkCli command line practice demonstrating node CRUD and monitoring.
Use ZkClient library to operate ZooKeeper via Java code, complete practical examples of session establishment, persistent node CRUD, child node change monitoring, and data change monitoring.
Deep dive into ZooKeeper's Leader election mechanism and ZAB (ZooKeeper Atomic Broadcast) protocol, covering initial election process, message broadcast three phases, fault recovery strategy, and production deployment suggestions.
Implement distributed lock based on ZooKeeper ephemeral sequential nodes, with complete Java code, covering lock competition, predecessor node monitoring, CountDownLatch synchronization, and recursive retry complete flow.
Comprehensive analysis of HBase distributed database overall architecture, including ZooKeeper coordination, HMaster management node, HRegionServer data node, Region storage unit, and four-dimensional data model, suitable for big data architecture selection reference.
Step-by-step configure HBase single node environment, explain hbase-env.sh, hbase-site.xml key parameters, complete integration with Hadoop HDFS and ZooKeeper cluster.
Complete HBase distributed cluster deployment: configure RegionServer on multiple nodes, HMaster high availability, integrate with ZooKeeper for coordination, with start/stop scripts and verification steps.
HBase Shell commands: create table, Put/Get/Scan/Delete operations, explain HBase data model with practical examples.
Using HBase Java Client API to implement table creation, insert, delete, Get query, full table scan, and range scan. Includes complete Maven dependencies and runnable code examples covering all common HBase operations.
Introduction to Redis: in-memory data structure store, key-value database, with comparison to traditional databases and typical use cases.
Install Redis 6.2.9 from source on Ubuntu, configure redis.conf for daemon mode, start redis-server and verify connection via redis-cli.
Comprehensive explanation of Redis five data types: String, List, Set, Sorted Set, and Hash. Includes common commands,底层特性, and typical usage scenarios with complete command examples.
Deep dive into Redis three advanced data types: Bitmap, Geo (GeoHash, Z-order curve, Base32 encoding), and Stream message stream, with common commands and practical examples.
Detailed explanation of Redis Pub/Sub working mechanism, three weak transaction flaws (no persistence, no acknowledgment, no retry), and alternative solutions in production.
Window operations integrate data from multiple batches over a longer time range by setting window length and slide duration. Cases demonstrate reduceByWindow...
This article introduces two Spark Streaming integration methods with Kafka: Receiver Approach and Direct Approach. Receiver uses Executor-based Receiver to...
When Spark Streaming integrates with Kafka, Offset management is key to ensuring data processing continuity and consistency. Offset marks message position in...
Offset is used to mark message position in Kafka partition. Proper management can achieve at-least-once or even exactly-once data processing semantics. By persisting Offset, application can resume consumption from last processed position during fault recovery, avoiding message loss or duplication.
Systematic explanation of Redis Lua script EVAL command syntax, differences between redis.call and redis.pcall, and four typical practical cases: atomic counter, CAS (Compare-And-Swap), batch operations, and distributed lock implementation using Lua scripts.
Detailed explanation of Redis slow query log configuration parameters (slowlog-log-slower-than, slowlog-max-len), core commands, and production-grade performance tuning strategies including data structure optimization, Pipeline usage, and monitoring system setup.
Apache Flink is an open-source big data stream processing framework, supporting efficient computation of unbounded stream and bounded batch data. With 'unified...
Apache Flink supports both stream processing and batch processing. Stream processing is suitable for real-time data like sensors, logs or trading streams,...
Flink's runtime architecture adopts typical Master/Slave pattern with clear division of responsibilities among core components. JobManager as Master is...
Flink provides multiple installation modes to suit different scenarios. Local mode is suitable for personal learning and small-scale debugging with simple...
Deploying Flink in YARN mode requires completing a series of environment configuration and cluster management operations. First, configure environment...
DataSource, Transformation and Sink. DataSource provides diverse data input methods including file systems, message queues, databases and custom data sources.
Systematic comparison of Redis two persistence solutions: RDB snapshot and AOF log — configuration methods, trigger mechanisms, pros and cons, AOF rewrite mechanism, and recommended strategies for production environments.
In-depth analysis of Redis RDB persistence mechanism, covering trigger methods, BGSAVE execution flow, configuration parameters, file structure, and comparison with AOF, helping you make informed persistence decisions in production environments.
Non-Parallel Source is a source operation in Flink with fixed parallelism of 1. It can only run in a single instance regardless of cluster scale, ensuring tasks are processed sequentially.
RichSourceFunction and RichParallelSourceFunction are enhanced source functions suitable for scenarios requiring complex logic and resource management.
Flink provides rich operators for DataStream to support flexible data stream processing in different scenarios. Common operators include Map, FlatMap and...
Flink's Sink is the final output endpoint for data stream processing, used to write processed results to external systems or storage media. It is the endpoint of streaming applications, determining how data is saved, transmitted or consumed.
JDBC Sink is one of the most commonly used data output components, often used to write stream and batch processing results to relational databases like MySQL,...
Flink's DataSet API is the core programming interface for batch processing, designed for processing static, bounded datasets, supporting TB to PB scale big...
Comprehensive analysis of Redis memory control mechanisms, including maxmemory configuration, three key expiration deletion strategies (lazy/active/scheduled), and 8 memory eviction policies with applicable scenarios and selection guidance.
Deep dive into Redis communication internals: RESP serialization protocol five data types, Pipeline batch processing mode, and how the epoll-based Reactor single-threaded event-driven architecture supports Redis high-concurrency processing capability.
Flink's Window mechanism is the core bridge between stream processing and unified batch processing architecture. Flink treats batch as a special case of stream processing, using time windows (Tumbling, Sliding, Session) and count windows to split infinite streams into finite datasets.
Sliding Window is one of the core mechanisms in Apache Flink stream processing, more flexible than fixed windows, widely used in real-time monitoring, anomaly...
Watermark is a special marker used to tell Flink the progress of events in the data stream. Simply put, Watermark is the 'current time' estimated by Flink in...
Flink's Watermark mechanism is one of the most core concepts in event time window computation, used for handling out-of-order events and ensuring accurate...
A Flink program consists of multiple Operators (Source, Transformation, Sink). An Operator is executed by multiple parallel Tasks (threads), and the number of...
Based on whether intermediate state is needed, Flink computation can be divided into stateful and stateless: Stateless computation like Map, Filter, FlatMap...
Systematic overview of the five most common Redis cache problems in high-concurrency scenarios: cache penetration, cache breakdown, cache avalanche, hot key, and big key. Analyzes the root cause of each problem and provides actionable solutions.
Redis optimistic lock in practice: WATCH/MULTI/EXEC mechanism explained, Lua scripts for atomic operations, SETNX+EXPIRE distributed lock from basics to Redisson, with complete Java code examples.
Broadcast State is an important mechanism in Apache Flink that supports dynamic logic updates in streaming applications, widely used in real-time risk control,...
State Storage (State Backend) is the core mechanism for implementing stateful stream computing in Flink, determining data reliability, performance and fault...
ManagedOperatorState is used to manage non-keyed state, achieving state consistency when operators recover from faults or scale. Developers can use ManagedOperatorState by implementing CheckpointedFunction interface, supporting ListState and BroadcastState two data structures.
In Flink, Parallelism is the core parameter measuring task concurrent processing capability, determining the number of tasks that can run simultaneously for...
Flink CEP is the core component for real-time analysis of complex event streams in Flink, providing a complete pattern matching framework, supporting...
Flink CEP timeout event extraction is a key环节 in stream processing, used to capture partial matching events that exceed the window time (within) during pattern...
Deep dive into Redis high availability: master-slave replication, Sentinel automatic failover, and distributed lock design with Docker deployment examples.
Systematic introduction to Kafka core architecture: Topic/Partition/Replica model, ISR mechanism, zero-copy optimization, message format and typical use cases.
Flink CEP (Complex Event Processing) complex event processing mechanism, combined with actual cases to deeply explain its application principles and practical...
Engineering perspective to quickly run Flink SQL: Provides modern dependencies (no longer using blink planner), minimum runnable example (MRE), Table API and...
For high-concurrency, low-latency OLAP scenarios, this article explains ClickHouse's underlying advantages (columnar+compression+vectorized, MergeTree family),...
Official recommended keyring + signed-by installation of ClickHouse on Ubuntu, start with systemd and self-check; provides single machine minimum example...
Using three-node cluster (h121/122/123) as example, first complete cluster connectivity self-check: system.clusters validation → ON CLUSTER create...
Sort through ClickHouse table engines: TinyLog, Log, StripeLog, Memory, Merge principles, applicable scenarios and pitfalls, provide reproducible minimum...
Deep dive into Kafka's three core components: Producer partitioning strategy and ACK mechanism, Broker Leader/Follower architecture, Consumer Group partition assignment and offset management.
Introduction to Kafka 2.x vs 3.x core differences, detailed cluster installation steps, ZooKeeper configuration, Broker parameter settings, and how KRaft mode replaces ZooKeeper dependency.
ClickHouse MergeTree key mechanisms: batch writes form parts, background merge (Compact/Wide two part forms), ORDER BY is sparse primary index,...
ClickHouse MergeTree storage and query path: column files (*.bin), sparse primary index (primary.idx), marker files (.mrk/.mrk2) and index_granularity...
Covers Kafka daily operations: daemon startup, Shell topic management commands, and Java client programming (complete Producer/Consumer code) with key configuration parameters and ConsumerRebalanceListener usage.
Detailed guide on integrating Kafka in Spring Boot projects, including dependency configuration, KafkaTemplate sync/async message sending, and complete @KafkaListener consumption practice.
Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.sh core config adjustments, and complete multi-node cluster distribution and startup.
ClickHouse two light aggregation engines ReplacingMergeTree and SummingMergeTree, combined with minimum runnable examples (MRE) and comparative queries,...
ClickHouse external data source engine minimum feasible solution: DDL templates, key parameters and read/write链路 for ENGINE=HDFS, ENGINE=MySQL, ENGINE=Kafka.
ClickHouse replica full chain: ZK/Keeper preparation, macros configuration, ON CLUSTER consistent table creation, write deduplication & replication mechanism,...
ClickHouse sharding × replica × Distributed architecture: Based on ReplicatedMergeTree + Distributed, using ON CLUSTER one-click table creation on 3-shard ×...
ClickHouse beginner and operations practice, based on real cluster (h121/h122/h123) demonstrating complete process from connection to database/table creation,...
Deep analysis of Kafka Producer initialization, message interception, serialization, partition routing, buffer batch sending, ACK confirmation and complete sending chain, with key parameter tuning suggestions.
Deep dive into Kafka message serialization and partition routing, including complete code for custom Serializer and Partitioner, mastering precise message routing and efficient transmission.
Apache Kudu in 2025 version and ecosystem integration: Latest Kudu 1.18.0 (2025/07) released, bringing segmented LRU Block Cache and RocksDB-based metadata...
Apache Kudu's Master/TabletServer architecture, RowSet (MemRowSet/DiskRowSet) write/read path, MVCC, and Raft consensus role in replica and failover; provides...
Apache Kudu Docker Compose quick deployment solution on Ubuntu 22.04 cloud host, covering Kudu Master and Tablet Server components,...
Java client (kudu-client 1.4.0) connects to Apache Kudu with multiple Masters (example ports 7051/7151/7251), completes full process of table creation, insert,...
Complete runnable example for Kudu, based on Flink 1.11.1 (Scala 2.12)/Java 11 and kudu-client 1.17.0 (2025 test). Through RichSinkFunction custom sink,...
Introduction to Kafka 0.10 Producer interceptor mechanism, covering onSend and onAcknowledgement interception points, interceptor chain execution order and error isolation, with complete custom interceptor implementation.
Detailed explanation of Kafka Consumer Group consumption model, partition assignment strategy, heartbeat keep-alive mechanism, and tuning practices for key parameters like session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms.
Apache Druid real-time OLAP practice: suitable for event detail with time as primary key, sub-second aggregation and high-concurrency self-service analysis.
Apache Druid 30.0.0 for single-machine quick verification and engineering implementation, systematically reviewing Druid architecture (Coordinator, Historical,...
Apache Druid 30.0.0 deployable solution covering MySQL metadata storage (mysql-connector-java 8.0.19), HDFS deep storage and HDFS indexing-logs, plus Kafka...
Low-memory cluster practice for Apache Druid 30.0.0 on three nodes: provides JVM parameters and runtime.properties key items for Broker/Historical/Router, explains off-heap memory and processing buffer ratio relationship.
Deep dive into Kafka Topic, Partition, Consumer Group core mechanisms, covering custom deserialization, offset management and rebalance optimization configuration.
Comprehensive introduction to Kafka Topic operations, including kafka-topics.sh commands, replica assignment strategy principles, and KafkaAdminClient Java API core usage.
Complete practice of Apache Druid real-time Kafka ingestion, using network traffic JSON as example, completing data ingestion through Druid console's Streaming/Kafka wizard, parsing time column, setting dimensions and metrics, and verifying results with SQL.
Apache Druid component responsibilities and deployment points from 0.13.0 to current (2025): Coordinator manages Historical node Segment...
Apache Druid data storage and high-performance query path: from DataSource/Chunk/Segment layering, to columnar storage, Roll-up pre-aggregation, Bitmap...
Scala Kafka Producer writes order/click data to Kafka Topic (example topic: druid2), continuous ingestion in Druid through Kafka Indexing Service. Since...
Deep dive into Kafka replica mechanism, including ISR sync node set maintenance, Leader election process, and unclean election trade-offs between consistency and availability.
Systematic explanation of how Kafka achieves Exactly-Once semantics through idempotent producers and transactions, covering PID/sequence number principle, cross-partition transaction configuration and end-to-end EOS implementation.
Deep analysis of Kafka log storage architecture, including LogSegment design, sparse offset index and timestamp index principles, message lookup flow, and log retention and cleanup strategy configuration.
Deep dive into Kafka's three I/O technologies achieving high throughput: sendfile zero-copy, mmap memory mapping and page cache sequential write, revealing kernel-level optimization behind million messages per second.
Background, evolution and engineering practice of Apache Kylin, focusing on MOLAP solution implementation path for massive data analysis. Core keywords: Apache...
Complete deployment record of Apache Kylin 3.1.1 on Hadoop 2.9.2, Hive 2.3.9, HBase 1.3.1, Spark 2.4.5 (without-hadoop, Scala 2.12) and three-node...
OLAP example: Generate dimension and fact data via Python, after Hive (wzk_kylin) load, design Cube in Kylin (dimensions/measures/Cuboids), and provide...
Apache Kylin (3.x/4.x) Cube setup and optimization: complete flow from DataSource → Model → Cube, covering dimension modeling, measure design, Cuboid...
Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core components.
Apache Kylin 4.0 Cube modeling and query acceleration method: Complete star modeling with fact tables and dimension tables, design dimensions and measures, use...
Using date field of Hive partitioned table as Partition Date Column, split Cube into multiple Segments, incrementally build by range to avoid repeated computation of historical data; also compare full build vs incremental build differences in query paths.
Apache Kylin Segment merge practice tutorial, covering manual MERGE Job flow, continuous Segment requirements, Auto Merge multi-level threshold strategy, Retention Threshold cleanup logic, deletion flow (Disable→Delete) and JDBC connection query examples.
Cuboid pruning optimization: When there are many dimensions, Cuboid count grows exponentially, causing long build time and storage expansion. Engineering...
Covers Aggregation Group, Mandatory Dimension, Hierarchy Dimension, Joint Dimension usage trade-offs, and explains impact of dictionary encoding, RowKey order, ShardBy sharding on build and query performance with CubeStatsReader precision/sparsity readings and RowKey/HBase storage model.
Kafka→Kylin real-time OLAP pipeline, providing minute-level aggregation queries for common 2025 business scenarios (e-commerce transactions, user behavior, IoT monitoring).
Comprehensive analysis of Spark core data abstraction RDD's five key features (partitions, compute function, dependencies, partitioner, preferred locations), lazy evaluation, fault tolerance, and narrow/wide dependency principles.
Detailed explanation of three RDD creation methods (parallelize, textFile, transform from existing RDD), and usage of common Transformation operators like map, filter, flatMap, groupBy, sortBy with lazy evaluation principles.
Article introduces core capabilities and common practices of Elasticsearch 8.x, Logstash 8.x, Kibana 8.x, covering key aspects of centralized logging system: collection, transmission, indexing, shard/replica, query DSL, aggregation and ILM lifecycle management.
Elasticsearch is a distributed full-text search engine, supports single-node mode and cluster mode deployment. Generally, small companies can use Single-Node Mode for their business scenarios.
Elasticsearch (ES 7.x/8.x) minimum examples: Create index, insert document, query by ID, update and _search search flow, with return samples and screenshots, help readers complete 'index/document CRUD' run-through in 3-10 minutes.
Elasticsearch 7.3.0 three-node cluster deployment practice tutorial, covering directory creation and permission settings, system parameter config...
Introduction to Elasticsearch-Head plugin and Kibana 7.3.0 installation and connectivity points, covering Chrome extension quick access, ES cluster health and...
Elasticsearch index create, existence check (single/multi/all), open/close/delete and health troubleshooting, as well as IK analyzer installation, ik_max_word/ik_smart analysis and Nginx hosting scheme for remote extended dictionary/stop words.
This article details Elasticsearch 7.x/8.x mapping config and document CRUD operations, including index/field mapping creation, mapping properties (type, index, store, analyzer), document create, query, full/partial update, delete by ID or condition.
In-depth explanation of core Query DSL usage in Elasticsearch 7.3, focusing on differences and pitfalls of match, match_phrase, query_string, multi_match and other full-text search statements in real business scenarios.
Deep dive into Spark cluster core components Driver, Cluster Manager, Executor responsibilities, comparison of Standalone, YARN, Kubernetes deployment modes, and static vs dynamic resource allocation strategies.
This article demonstrates Elasticsearch term-level queries including term, terms, range, exists, prefix, regexp, fuzzy, ids queries, and bool compound queries. Covers creating book index, inserting sample data, various query DSL examples and execution results.
This article introduces Filter DSL vs query difference: Filter DSL doesn't calculate relevance score, specifically optimized for filter scenario execution...
Covers complete practice of Metrics Aggregations and Bucket Aggregations, applicable to common Elasticsearch 7.x / 8.x versions in 2025. Article starts with...
elasticsearch-rest-high-level-client implements index and document CRUD, including: create index via JSON and XContentBuilder two ways, config shards and replicas, delete index, insert single document, query document by ID and use match_all to query all data.
Comprehensive introduction to Spark RDD Action operations, covering data collection, statistical aggregation, element retrieval, storage output categories, and detailed explanation of Key-Value RDD core operators like groupByKey, reduceByKey, join.
Article analyzes Elasticsearch inverted index principle based on Lucene, compares forward index vs inverted index differences, covering core concepts like...
Article details core mechanism of Elasticsearch near real-time search, including Lucene Segment, Memory Buffer, File System Cache, Refresh, Flush and Translog...
Explains why refresh causes small segment increase, how segment merge merges small segments into large ones in background and cleans deleted documents, why too...
Article details core data structure of Elasticsearch inverted index: Terms Dictionary, Posting List, FST (Finite State Transducer) and SkipList how accelerate...
Elasticsearch concurrency conflicts (inventory deduction read-modify-write) breakdown write overwrite cause, and gives engineering solution using ES optimistic...
Disk columnar data structure generated at indexing time, optimized for sorting, aggregation and script values; most supported types enabled by default, text fields don't provide doc values by default, need to use keyword subfield or enable fielddata for aggregation/sorting.
Logstash 7 getting started tutorial, covering stdin/file collection, sincedb mechanism and start_position effect conditions, with error quick reference table
Logstash Input plugin comparison, breakdown technical differences between JDBC Input and Syslog collection pipeline, applicable scenarios and key configs. JDBC...
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two approaches, comparing Cartesian product vs data transformation performance differences.
Article explains using grok in Logstash 7.3.0 environment to extract structured fields from console stdin and Nginx access logs (IP, time_local, method, request, status etc), and quickly verify parsing effect through stdout { codec => rubydebug }.
Logstash Output plugin (Logstash 7.3.0) practical tutorial, covering stdout (rubydebug) for debugging, file output for local archiving, Elasticsearch output...
Configure Nginx log_format json to output structured access_log (containing @timestamp, request_time, status, request_uri, ua and other fields), start...
Filebeat collects Nginx access.log and writes to Kafka, Logstash consumes from Kafka and parses message embedded JSON by field (app/type) conditions, adds...
Master / Data / Coordinating node responsibilities and production role isolation strategies, capacity planning calculations (JVM Heap 30-32GB limit, hot/cold data with disk/IO constraints, horizontal scaling path), plus shard and replica as core knobs for performance and reliability.
DataX (DataX 3.0) is an offline data synchronization/data integration tool widely used and open-sourced within Alibaba, for enterprise-level heterogeneous data...
Implement complete production-ready word frequency pipeline: lowercase conversion, punctuation removal, stop word filtering, word frequency counting, finally efficiently write to MySQL via foreachPartition, comparing row-by-row insert vs partition batch write performance.
Deep dive into Spark Driver-Executor process communication, Java/Kryo serialization selection, closure serialization problem troubleshooting, and RDD dependencies, Stage division and persistence storage levels.
Apache Tez (example version Tez 0.9.x) as execution engine alternative to MapReduce on Hadoop2/YARN, providing DAG (Directed Acyclic Graph) execution model for...
2025's most commonly used machine learning concept framework: supervised learning (classification/regression), unsupervised learning (clustering/dimensionality...
KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python...
From unified API (fit/predict/transform/score) to kneighbors to find K nearest neighbors of test samples, then using learning curve/parameter curve to select...
Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation. Through sklearn's cross_val_score to...
In scikit-learn machine learning training pipeline, distance-based models like KNN are extremely sensitive to inconsistent feature scales: Euclidean distance...
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long dependency chain scenarios.
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices, and performance optimization solution using broadcast to implement MapSideJoin instead of shuffle join.
Decision Tree model systematic overview for classification tasks: three types of nodes (root/internal/leaf), recursive split flow from root to leaf, and...
Decision tree information gain (Information Gain) explained, first using information entropy (Entropy) to explain impurity, then explaining why when splitting...
Complete chain from 'split' to 'pruning', explain why usually uses greedy algorithm to form 'local optimum', and differences in splitting criteria between...
Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version). Focus on...
Common parameters for decision tree pruning (pre-pruning) in engineering: max_depth, min_samples_leaf, min_samples_split, max_features, min_impurity_decrease...
Confusion matrix (TP, FP, FN, TN) establishes unified口径, explains business meaning of Accuracy, Precision (查准率), Recall (查全率/敏感度), F1 Measure: Precision...
Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and RDD optimization strategies.
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integration with Hive/HDFS.
Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown; use loss function to characterize...
pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation). Core idea is to form normal...
When using scikit-learn for linear regression, how to handle multicollinearity in least squares method. Multicollinearity may cause instability in regression...
Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine...
Logistic Regression (LR) is an important classification algorithm in machine learning, widely used in binary classification tasks like sentiment analysis,...
As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8, training...
Deep comparison of Spark's three data abstractions RDD, DataFrame, Dataset features and use cases, introduction to SparkSession unified entry, and demonstration of mutual conversion methods between abstractions.
Systematically review SparkSQL Transformation and Action operators, covering select, filter, join, groupBy, union operations, with practical test cases demonstrating usage and performance optimization.
When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy. If training doesn't...
K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed), with engineering applications in customer...
Python K-Means clustering implementation: using NumPy broadcasting to compute squared Euclidean distance (distEclud), initializing centroids via uniform...
K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification...
scikit-learn (sklearn) KMeans (2026) explains three most commonly used objects: cluster_centers_ (cluster centers), inertia_ (Within-Cluster Sum of Squares),...
KMeans n_clusters selection method: calculate silhouette_score and silhouette_samples on candidate cluster numbers (e.g., 2/4/6/8), determine optimal k by...
Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for metadata and table operations.
Deep dive into SparkSQL's five Join execution strategies (BHJ, SHJ, SMJ, Cartesian, BNLJ) selection conditions and use cases, along with the complete processing flow of Catalyst optimizer from SQL parsing to code generation.
Prometheus 2.53.2 (still common in existing environments in 2025/2026) provides a reusable deployment process: download and extract binary on monitoring...
Common Prometheus monitoring落地场景: Install node_exporter-1.8.2 on Rocky Linux (CentOS/RHEL compatible) to expose host metrics, integrate with Prometheus...
For OPs/devs still using CentOS/RHEL (including compatible distributions) in 2026, provides Grafana 11.3.0 (grafana-enterprise-11.3.0-1.x86_64.rpm) direct YUM...
2026 engineering practice, covering core concepts and implementation concerns for data warehouses: starting from enterprise data silos, explaining four...
When implementing an Offline Data Warehouse in an enterprise, the two most common problems are: data silos caused by data mart expansion, and duplicate...
Offline data warehouse modeling practice, covering fact tables, dimension tables, fact types, and snowflake/galaxy models.
Introduction to Spark's two generations of real-time computing frameworks: DStream micro-batch processing model's architecture and limitations, and how Structured Streaming solves EventTime processing and API consistency issues through unbounded table model and Catalyst optimization.
Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation, with complete Scala code examples.
Complete guide to building offline data warehouses: data tracking, metric system design, theme analysis, and standardization practices for e-commerce teams.
Offline Data Warehouse (Offline DW) overall architecture design and implementation method: Framework selection comparison between Apache community version and...
Offline data warehouse member metrics practice with Flume, covering new members, active members, and retention metrics.
Flume 1.9.0 engineering optimization in offline data warehouse (log collection→HDFS) scenarios: Giving actionable value ranges and trade-off principles for key...
Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blacklist filtering: leftOuterJoin, SQL, and broadcast variables.
In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-state maintenance and mapWithState incremental optimization, with complete Scala code.
Using TAILDIR Source to monitor multiple directories (start/event), using filegroups headers to mark different sources with logtype; then using custom...
Apache Flume's offline log collection chain, providing a set of engineering implementation: Use Taildir Source to monitor multiple directories and multiple...
Engineering implementation of the ODS (Operational Data Store) layer in an offline data warehouse: the minimal closed loop of Hive external table + daily...
Detailed explanation of two Spark Streaming integration modes with Kafka: Receiver-based high-level API vs Direct mode architecture differences, offset management, Exactly-Once semantics guarantee, and complete Scala code implementation.
JSON data processing in Hive offline data warehouse, covering three most common needs: 1) Extract array fields from JSON strings and explode-expand in SQL; 2)...
This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly). Covers the complete flow from DWD...
Offline data warehouse calculates 'new members' daily, and provides consistent definition data foundation for subsequent 'member retention'. Use 'full member table (with first day dt)' as deduplication anchor, DWS outputs new member details, ADS outputs new member count.
Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch processing model, and comparison with Spark Streaming for technology selection.
Implementation of 'Member Retention' in offline data warehouse: DWS layer uses dws_member_retention_day table to join new member and startup detail tables to...
The landing path for exporting Hive ADS layer tables to MySQL in offline data warehouse. Gives typical DataX solution: hdfsreader -> mysqlwriter. Focuses on DataX JSON configuration and common error fixes.
Demonstrates a complete pipeline from log collection to member metric analysis, covering Flume Taildir monitoring, HDFS partition storage, Hive external table loading, ODS/DWD/DWS/ADS layered processing, supporting active members, new members, member retention metrics calculation.
Complete tutorial for Apache Flink installation and deployment in three modes: Local, Standalone cluster, and YARN integration, including environment configuration, parameter tuning, and common issue solutions.
Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and job submission process.
Offline data warehouse practice based on Hadoop + Hive + HDFS + DataX + MySQL, covering member metrics testing (active/new/retention), HDFS export, DataX sync to MySQL, and advertising business ODS/DWD/ADS full process modeling.
Hive offline data warehouse advertising business practice, combined with typical pipeline of Flume + Hive + UDF + Parquet, demonstrates how to map raw event...
Implementation of advertising impression, click, purchase hourly statistics based on Hive offline data warehouse, completing CTR, CVR and advertising effect...
Flink DataStream API getting started guide, program execution flow, environment acquisition, data source definition, operator chaining and execution mode details, demonstrating stream processing program development through WordCount case.
Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing mechanism.
Using Flume Agent to collect event logs and write to HDFS, then use Hive scripts to complete ODS and DWD layer data loading by date. Content covers Flume Agent's Source, Channel, Sink basic structure, log file upload, Flume startup command, HDFS write verification.
Complete solution for exporting Hive ADS layer data to MySQL using DataX. Covers ADS loading, DataX configuration, MySQL table creation, Shell script parameterized execution, and common error diagnosis and fix checklist.
Focusing on three main metrics: order count, product count, payment amount, breakdown analysis dimensions by sales region and product type (3-level category).
Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.
Using DataX (MySQLReader + HDFSWriter) to extract daily incremental data from MySQL order tables, order detail tables, and product information tables into...
Sync MySQL data to specified HDFS directory via DataX, then create ODS external tables in Hive with unified dt string partitioning. Enables fast queries of raw transaction records within 7 days, demonstrating core characteristics of ODS layer.
This article systematically covers Slowly Changing Dimensions (SCD), detailing the core differences between SCD Type 0, 1, 2, 3, 4, and 6, and explains the applicable boundaries of snapshot tables and zipper tables in Hive offline data warehouse scenarios.
This article provides a practical guide to Hive zipper table implementation for offline data warehouse modeling, covering initial loading, daily incremental updates, historical version chain closing, Shell scheduling scripts, and rollback recovery logic.
This article provides a practical guide to Hive zipper tables for preserving order history states at low cost in offline data warehouses, supporting daily backtracking and change analysis. It covers incremental refresh of order status changes using 2020 order data as a case study.
This article continues the zipper table practice, focusing on order history state incremental refresh. It explains how to use ODS daily incremental table + DWD zipper table to preserve order historical states at low cost while supporting daily backtracking and change analysis.
First determines fact tables vs dimension tables: green indicates fact tables, gray indicates dimension tables. Dimension table processing strategies vary by...
The order table is a periodic fact table; zipper tables can be used to retain order status history. Order detail tables are regular fact tables. Order statuses...
Apache Airflow is an open-source task scheduling and workflow management platform, primarily used for developing, debugging, and monitoring data pipelines.
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...
Flink Broadcast State explanation: BroadcastState principle, dynamic rule updates, state partitioning and memory management, demonstrating broadcast stream and non-broadcast stream join through cases.
Flink State Backend detailed explanation: HashMapStateBackend, EmbeddedRocksDBStateBackend selection, memory configuration and performance tuning.
Linux systems use the cron (crond) service for scheduled tasks, which is enabled by default. Linux also provides the crontab command for user-level task...
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...
Apache Airflow is an open-source task scheduling and workflow management tool for orchestrating complex data processing tasks. Originally developed by Airbnb...
Flink memory model detailed explanation: Network Buffer Pool, Task Heap, State Backend memory allocation, GC tuning and backpressure handling.
Flink parallelism detailed explanation: Operator Chaining, Slot allocation strategy, parallelism settings and resource scheduling principle.
Atlas is a metadata framework for the Hadoop platform: a set of scalable core governance services enabling enterprises to effectively meet compliance...
Flink CEP detailed explanation: pattern sequence, individual patterns, combined patterns, matching skip strategies and practical cases.
Metadata (MetaData) in the narrow sense refers to data that describes data. Broadly, all information beyond business data used to maintain system operation can...
Apache Griffin is an open-source big data quality solution that supports both batch and streaming data quality detection. It can measure data assets from...
Livy is a REST interface for Apache Spark designed to simplify Spark job submission and management, especially in big data processing scenarios. Its main...
Apache Griffin is an open-source data quality management framework designed to help organizations monitor and improve data quality in big data environments.
Real-time data processing capability has become a key competitive factor for enterprises. Initially, each new requirement spawned a separate real-time task,...
Realtime data warehouse is a data warehouse system that differs from traditional batch processing data warehouses by emphasizing low latency, high throughput,...
Alibaba B2B's cross-region business between domestic sellers and overseas buyers drove the need for data synchronization between Hangzhou and US data centers.
Canal is an open-source tool for MySQL database binlog incremental subscription and consumption, primarily used for data synchronization and distributed...
MySQL's Binary Log (binlog) is a log file type in MySQL that records all change operations performed on the database (excluding SELECT and SHOW queries). It is...
Canal is an open-source data synchronization tool from Alibaba for MySQL database incremental log parsing and synchronization. It simulates the MySQL slave...
This article introduces Alibaba's open-source Canal tool, which implements Change Data Capture (CDC) by parsing MySQL binlog. Demonstrates how to integrate...
In internet companies, common ODS data includes business log data (Log) and business DB data. For business DB data, collecting data from relational databases...
Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent...
Writing dimension tables (DIM) from Kafka typically involves reading real-time or batch data from Kafka topics and updating dimension tables based on the data...
DW (Data Warehouse layer) is built from DWD, DWS, and DIM layer data, completing data architecture and integration, establishing consistent dimensions, and...
Logistic regression is a classification model in machine learning — an efficient binary classification algorithm widely used in ad click-through rate...
Linear regression uses regression equations to model relationships between independent and dependent variables. This article covers regression scenarios (house...
This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression. Logistic regression is an efficient binary classification algorithm widely used in fields such as ad click-through rate prediction and spam email identification.
This article introduces the basic concepts, classification principles, and classification principles of decision trees. Decision tree is a non-linear...
This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms...
This article systematically introduces ensemble learning methods in machine learning. Main content includes: 1) Basic definition and classification of ensemble...
Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles. Main content:...
This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm. First explains boosting tree basic concept through simple examples, then details algorithm flow including negative gradient calculation, regression tree fitting, and model update steps.
GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training. Covers GBDT...