Tag: Data Engineering
57 articles
Hive Slowly Changing Dimension Type 2: Order History State Management
Offline data warehouse needs to save order history state at low cost while supporting daily rollback and change analysis.
Big Data 95 - Flink State and Checkpoint: State Management, Fault Tolerance and Savepoints
Flink stateful computation explanation: Keyed State, Operator State, Checkpoint configuration, Savepoint backup and recovery, production environment practices.
Big Data 93 - Flink Streaming Introduction: DataStream API and Program Structure
This is article 93 in the Big Data series, introducing Flink DataStream API core concepts and program structure.
Flink Window and Watermark: Time Windows, Tumbling/Sliding/Session
Comprehensive analysis of Flink Window mechanism: tumbling windows, sliding windows, session windows, Watermark principle and generation strategies, late data processing...
Big Data 91 - Flink Installation & Deployment: Local, Standalone and YARN Modes
Apache Flink is a distributed stream processing framework widely used for real-time data computing scenarios.
Flink on YARN Deployment: Environment Preparation, Resource Manager
Detailed explanation of three Flink deployment modes on YARN cluster: Session, Application, Per-Job modes, Hadoop dependency configuration, YARN resource application and...
Big Data 90 - Apache Flink Introduction: Unified Stream-Batch Real-Time Computing
Systematic introduction to Apache Flink's origin, core features, and architecture components: JobManager, TaskManager, Dispatcher responsibilities, unified stream-batch p...
Big Data 89 - Spark Streaming with Kafka: Receiver vs Direct Mode
This is article 89 in the Big Data series, deeply comparing two core modes of Spark Streaming integration with Kafka, focusing on Direct mode production practices.
Big Data 87 - Spark DStream Transformation Operators: map, reduceByKey and transform
Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blac...
Spark Streaming Window Operations & State Tracking: updateStateByKey & mapWithState
In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-stat...
Spark Streaming Introduction: From DStream to Structured Streaming
This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark's two generations of streaming frameworks.
Spark Streaming Data Sources: File Stream, Socket, RDD RDD Queue
Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation.
SparkSQL Statements: DataFrame Operations, SQL Queries &
Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for meta...
Big Data 84 - SparkSQL Internals: Five Join Strategies & Catalyst Optimizer
This is article 84 in the Big Data series, deeply analyzing SparkSQL kernel's Join strategy auto-selection logic and SQL parsing optimization flow.
SparkSQL Core Abstractions: RDD, DataFrame, Dataset & SparkSession
This is article 81 in the Big Data series, comprehensively introducing Spark's three core data abstractions' features, use cases and mutual conversions.
SparkSQL Operators: Transformation & Action Operations
This is article 82 in the Big Data series, systematically introducing SparkSQL Transformation and Action operators with complete test cases.
Spark Standalone Mode: Architecture & Performance Tuning
Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and...
SparkSQL Introduction: SQL & Distributed Computing Fusion
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integra...
Spark RDD Fault Tolerance: Checkpoint Principle & Best Best Practices
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long...
Spark Broadcast Variables: Efficient Shared Read-Only Data
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices.
Spark Super WordCount: Text Cleaning & MySQL Persistence
This is article 75 in the Big Data series, on top of basic WordCount add text preprocessing and database persistence, build a near-production word frequency pipeline.
Spark Scala WordCount Implementation
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Spark Scala Practice: Pi Estimation & Mutual Friends
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two appr...
Spark Action Operations Overview
This is article 72 in the Big Data series, systematically reviewing Spark RDD Action operators.
Spark Cluster Architecture & Deployment Modes
This is article 71 in the Big Data series, introducing Spark cluster core architecture, deployment mode comparisons, and static/dynamic resource management strategies.
Spark RDD Deep Dive: Five Key Features
This is article 69 in the Big Data series, deeply analyzing RDD, Spark's core data abstraction, its five key features and design principles.
Spark RDD Creation & Transformation Operations
This is article 70 in the Big Data series, comprehensively explaining Spark RDD's three creation methods and practical usage of common Transformation operators.
From MapReduce to Spark: Big Data Computing Evolution
Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core compon...
Kafka Storage Mechanism: Log Segmentation & Retention
This is article 65 in the Big Data series, deeply analyzing Kafka's log storage mechanism.
Kafka Topic Management: Commands & Java API
Comprehensive introduction to Kafka Topic operations, including kafka-topics.sh commands, replica assignment strategy principles, and KafkaAdminClient Java API core usage.
Kafka Operations: Shell Commands & Java Client Examples
Covers Kafka daily operations: daemon startup, Shell topic management commands, and Java client programming (complete Producer/Consumer code) with key configuration param...
Spark Distributed Environment Setup
Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.
Kafka Installation: From ZooKeeper to KRaft Evolution
Introduction to Kafka 2.x vs 3.x core differences, detailed cluster installation steps, ZooKeeper configuration, Broker parameter settings, and how KRaft mode replaces Zo...
Redis Memory Management: Key Expiration and Eviction Policies
Comprehensive analysis of Redis memory control mechanisms, including maxmemory configuration, three key expiration deletion strategies (lazy/active/scheduled).
Redis Persistence: RDB vs AOF Comparison and Production Settings
Systematic comparison of Redis two persistence solutions: RDB snapshot and AOF log — configuration methods, trigger mechanisms, pros and cons, AOF rewrite mechanism.
Big Data 46 - Redis RDB Persistence: Snapshot Principles, Configuration and Tradeoffs
In-depth analysis of Redis RDB persistence mechanism, covering trigger methods, BGSAVE execution flow, configuration parameters, file structure, and comparison with AOF.
Redis Slow Query Log and Performance Tuning in Production
Detailed explanation of Redis slow query log configuration parameters (slowlog-log-slower-than, slowlog-max-len), core commands.
Redis Advanced Data Types: Bitmap, Geo and Stream
Deep dive into Redis three advanced data types: Bitmap, Geo (GeoHash, Z-order curve, Base32 encoding), and Stream message stream, with common commands and practical examp...
Redis Single Node and Cluster Installation
Install Redis 6.2.9 from source on Ubuntu, configure redis.conf for daemon mode, start redis-server and verify connection via redis-cli.
Big Data 40 - Redis Five Data Types: Command Reference and Practice
Comprehensive explanation of Redis five data types: String, List, Set, Sorted Set, and Hash. Includes common commands, underlying characteristics, and typical usage scena...
Big Data 37 - HBase Java API: Complete CRUD Code with Table Creation
Using HBase Java Client API to implement table creation, insert, delete, Get query, full table scan, and range scan.
Big Data 33 - HBase Overall Architecture: HMaster, HRegionServer and Data Model
Comprehensive analysis of HBase distributed database overall architecture, including ZooKeeper coordination, HMaster management node, HRegionServer data node...
HBase Single Node Configuration: hbase-env and hbase-site.xml
Step-by-step configure HBase single node environment, explain hbase-env.sh, hbase-site.xml key parameters, complete integration with Hadoop HDFS and ZooKeeper cluster.
Sqoop Incremental Import and CDC Change Data Capture Principles
Introduce Sqoop's --incremental append incremental import mechanism, and deeply explain CDC (Change Data Capture) core concepts, capture method comparisons...
Big Data 23 - Sqoop Partial Import: --query, --columns and --where
Detailed explanation of three ways Sqoop imports partial data from MySQL to HDFS by condition: custom query, specify columns, WHERE condition filtering, with applicable s...
Sqoop and Hive Integration: MySQL ↔ Hive Bidirectional Data Transfer
Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usa...
Sqoop Data Migration ETL Tool Introduction and Installation
Introduction to Apache Sqoop core principles, use cases, and installation configuration steps on Hadoop cluster, helping quickly get started with batch data migration bet...
Sqoop Practice: MySQL Full Data Import to HDFS
Complete example demonstrating Sqoop importing MySQL table data to HDFS, covering core parameter explanations, MapReduce parallel mechanism, and execution result verifica...
Flume Collect Hive Logs to HDFS
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log dat...
Flume Dual Sink: Write Logs to Both HDFS and Local File
This is article 20 in the Big Data series. Demonstrates Flume replication mode with dual Sink architecture—same data written to both HDFS and local filesystem.
Apache Flume Architecture and Core Concepts
Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.
Flume Hello World: NetCat Source + Memory Channel + Logger Sink
Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→...
Hive Metastore Three Modes and Remote Deployment
Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cl...
HiveServer2 Configuration and Beeline Remote Connection
Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.
Hive DDL and DML Operations
Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations.
Hive HQL Advanced: Data Import/Export and Query Practice
Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation...
Hive Introduction: Architecture and Cluster Installation
Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop clu...