Tag: data-engineering
57 articles
Hive Slowly Changing Dimension Type 2: Order History Stat...
Offline data warehouse needs to save order history state at low cost while supporting daily rollback and change analysis. This article introduces using ODS...
Flink State and Checkpoint: State Management, Fault Toler...
Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.
Flink Streaming Introduction: DataStream API & Program St...
Flink DataStream API getting started guide, program execution flow, environment acquisition, data source definition, operator chaining and execution mode details, demonstrating stream processing pr...
Flink Window and Watermark: Time Windows, Tumbling/Slidin...
Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing mechanism.
Flink Installation & Deployment: Local, Standalone, YARN ...
Complete tutorial for Apache Flink installation and deployment in three modes: Local, Standalone cluster, and YARN integration, including environment configuration, parameter tuning, and common iss...
Flink on YARN Deployment: Environment Preparation, Resour...
Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and job submission process.
Apache Flink Introduction: Unified Stream-Batch Real-Time...
Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch processing model, and compar...
Spark Streaming Integration with Kafka: Receiver and Dire...
Detailed explanation of two Spark Streaming integration modes with Kafka: Receiver-based high-level API vs Direct mode architecture differences, offset management, Exactly-Once semantics guarantee,...
Spark DStream Transformation Operators: map, reduceByKey,...
Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blacklist filtering: leftOuterJ...
Spark Streaming Window Operations & State Tracking: updat...
In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-state maintenance and mapWithSt...
Spark Streaming Introduction: From DStream to Structured ...
Introduction to Spark's two generations of real-time computing frameworks: DStream micro-batch processing model's architecture and limitations, and how Structured Streaming solves EventTime process...
Spark Streaming Data Sources: File Stream, Socket, RDD Qu...
Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation, with complete Scala code exam...
SparkSQL Statements: DataFrame Operations, SQL Queries & ...
Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for metadata and table operations.
SparkSQL Kernel: Five Join Strategies & Catalyst Optimize...
Deep dive into SparkSQL's five Join execution strategies (BHJ, SHJ, SMJ, Cartesian, BNLJ) selection conditions and use cases, along with the complete processing flow of Catalyst optimizer from SQL ...
SparkSQL Core Abstractions: RDD, DataFrame, Dataset & Spa...
Deep comparison of Spark's three data abstractions RDD, DataFrame, Dataset features and use cases, introduction to SparkSession unified entry, and demonstration of mutual conversion methods between...
SparkSQL Operators: Transformation & Action Operations
Systematically review SparkSQL Transformation and Action operators, covering select, filter, join, groupBy, union operations, with practical test cases demonstrating usage and performance optimizat...
Spark Standalone Mode: Architecture & Performance Tuning
Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and RDD optimization strategies.
SparkSQL Introduction: SQL & Distributed Computing Fusion
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integration with Hive/HDFS.
Spark RDD Fault Tolerance: Checkpoint Principle & Best Pr...
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long dependency chain scenarios.
Spark Broadcast Variables: Efficient Shared Read-Only Data
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices, and performance optimization solution using broadcast to implement MapSideJoin inste...
Spark Super WordCount: Text Cleaning & MySQL Persistence
Implement complete production-ready word frequency pipeline: lowercase conversion, punctuation removal, stop word filtering, word frequency counting, finally efficiently write to MySQL via foreachP...
Spark Scala WordCount Implementation
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Spark Scala Practice: Pi Estimation & Mutual Friends
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two approaches, comparing Cartesian...
Spark Action Operations Overview
Comprehensive introduction to Spark RDD Action operations, covering data collection, statistical aggregation, element retrieval, storage output categories, and detailed explanation of Key-Value RDD...
Spark Cluster Architecture & Deployment Modes
Deep dive into Spark cluster core components Driver, Cluster Manager, Executor responsibilities, comparison of Standalone, YARN, Kubernetes deployment modes, and static vs dynamic resource allocati...
Spark RDD Deep Dive: Five Key Features
Comprehensive analysis of Spark core data abstraction RDD's five key features (partitions, compute function, dependencies, partitioner, preferred locations), lazy evaluation, fault tolerance, and n...
Spark RDD Creation & Transformation Operations
Detailed explanation of three RDD creation methods (parallelize, textFile, transform from existing RDD), and usage of common Transformation operators like map, filter, flatMap, groupBy, sortBy with...
From MapReduce to Spark: Big Data Computing Evolution
Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core components.
Kafka Storage Mechanism: Log Segmentation & Retention
Deep analysis of Kafka log storage architecture, including LogSegment design, sparse offset index and timestamp index principles, message lookup flow, and log retention and cleanup strategy configu...
Kafka Topic Management: Commands & Java API
Comprehensive introduction to Kafka Topic operations, including kafka-topics.sh commands, replica assignment strategy principles, and KafkaAdminClient Java API core usage.
Kafka Operations: Shell Commands & Java Client Examples
Covers Kafka daily operations: daemon startup, Shell topic management commands, and Java client programming (complete Producer/Consumer code) with key configuration parameters and ConsumerRebalance...
Spark Distributed Environment Setup
Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.sh core config adjustments, and complete multi...
Kafka Installation: From ZooKeeper to KRaft Evolution
Introduction to Kafka 2.x vs 3.x core differences, detailed cluster installation steps, ZooKeeper configuration, Broker parameter settings, and how KRaft mode replaces ZooKeeper dependency.
Redis Memory Management: Key Expiration and Eviction Poli...
Comprehensive analysis of Redis memory control mechanisms, including maxmemory configuration, three key expiration deletion strategies (lazy/active/scheduled), and 8 memory eviction policies with a...
Redis Persistence: RDB vs AOF Comparison and Production S...
Systematic comparison of Redis two persistence solutions: RDB snapshot and AOF log — configuration methods, trigger mechanisms, pros and cons, AOF rewrite mechanism, and recommended strategies for ...
Redis RDB Persistence: Snapshot Principles, Configuration...
In-depth analysis of Redis RDB persistence mechanism, covering trigger methods, BGSAVE execution flow, configuration parameters, file structure, and comparison with AOF, helping you make informed p...
Redis Slow Query Log and Performance Tuning in Production
Detailed explanation of Redis slow query log configuration parameters (slowlog-log-slower-than, slowlog-max-len), core commands, and production-grade performance tuning strategies including data st...
Redis Advanced Data Types: Bitmap, Geo and Stream
Deep dive into Redis three advanced data types: Bitmap, Geo (GeoHash, Z-order curve, Base32 encoding), and Stream message stream, with common commands and practical examples.
Redis Single Node and Cluster Installation
Install Redis 6.2.9 from source on Ubuntu, configure redis.conf for daemon mode, start redis-server and verify connection via redis-cli.
Redis Five Data Types: Complete Command Reference and Pra...
Comprehensive explanation of Redis five data types: String, List, Set, Sorted Set, and Hash. Includes common commands, underlying characteristics, and typical usage scenarios with practical examples.
HBase Java API: Complete CRUD Code with Table Creation, I...
Using HBase Java Client API to implement table creation, insert, delete, Get query, full table scan, and range scan. Includes complete Maven dependencies and runnable code examples covering all com...
HBase Overall Architecture: HMaster, HRegionServer and Da...
Comprehensive analysis of HBase distributed database overall architecture, including ZooKeeper coordination, HMaster management node, HRegionServer data node, Region storage unit, and four-dimensio...
HBase Single Node Configuration: hbase-env and hbase-site...
Step-by-step configure HBase single node environment, explain hbase-env.sh, hbase-site.xml key parameters, complete integration with Hadoop HDFS and ZooKeeper cluster.
Sqoop Incremental Import and CDC Change Data Capture Prin...
Introduce Sqoop's --incremental append incremental import mechanism, and deeply explain CDC (Change Data Capture) core concepts, capture method comparisons, and modern solutions like Flink CDC, Deb...
Sqoop Partial Import: --query, --columns, --where Three F...
Detailed explanation of three ways Sqoop imports partial data from MySQL to HDFS by condition: custom query, specify columns, WHERE condition filtering, with applicable scenarios and precautions.
Sqoop and Hive Integration: MySQL ↔ Hive Bidirectional Da...
Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usage.
Sqoop Data Migration ETL Tool Introduction and Installation
Introduction to Apache Sqoop core principles, use cases, and installation configuration steps on Hadoop cluster, helping quickly get started with batch data migration between MySQL and HDFS/Hive.
Sqoop Practice: MySQL Full Data Import to HDFS
Complete example demonstrating Sqoop importing MySQL table data to HDFS, covering core parameter explanations, MapReduce parallel mechanism, and execution result verification.
Flume Collect Hive Logs to HDFS
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log data landing to HDFS.
Flume Dual Sink: Write Logs to Both HDFS and Local File
Through Flume replication mode (Replicating Channel Selector) and three-Agent cascade architecture, implement same log data written to both HDFS and local file, meeting both offline analysis and re...
Apache Flume Architecture and Core Concepts
Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.
Flume Hello World: NetCat Source + Memory Channel + Logge...
Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→Channel→Sink data flow.
Hive Metastore Three Modes and Remote Deployment
Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cluster.
HiveServer2 Configuration and Beeline Remote Connection
Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.
Hive DDL and DML Operations
Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations, with complete HiveQL examples and configuration optim...
Hive HQL Advanced: Data Import/Export and Query Practice
Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation, filtering, and sorting.
Hive Introduction: Architecture and Cluster Installation
Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop cluster.