Tag: Spark
40 articles
AI Investigation #51: Big Data Technology Evolution - Obs...
Big data technology evolution: MapReduce replaced by Spark, Storm replaced by Flink, Pig/Hive gradually phased out. This article analyzes why these technologies were eliminated and the technical re...
AI Investigation #50: Big Data Evolution - Two Decades of...
Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing. Architecture evolved from monolithic Hadoop to YARN mul...
AI Research 49 - Big Data Survey Report: Development Hist...
Big data development began in 1997 when NASA proposed the concept, 2003-2006 Google published GFS, MapReduce, Bigtable three major papers leading distributed computing revolution. 2005 saw Hadoop b...
Spark MLlib GBDT Case Study: Residual Calculation to Regr...
GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training. Covers GBDT...
Spark MLlib: Bagging vs Boosting Differences and GBDT Gra...
Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles. Main content:...
Spark MLlib GBDT Algorithm: Gradient Boosting Principles,...
This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm. First explains boosting tree basic concept through simple examples, then details algorithm flow i...
Spark MLlib Ensemble Learning: Random Forest, Bagging and...
This article systematically introduces ensemble learning methods in machine learning. Main content includes: 1) Basic definition and classification of ensemble...
Spark MLlib Decision Tree Pruning: Pre-pruning, Post-prun...
This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms...
Spark MLlib Decision Tree: Classification Principles, Gin...
This article introduces the basic concepts, classification principles, and classification principles of decision trees. Decision tree is a non-linear...
Spark MLlib Logistic Regression: Input Function, Sigmoid,...
This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression. Logistic regression is an efficient binary classification algorithm wi...
Spark MLlib Linear Regression: Scenarios, Loss Function a...
Linear regression uses regression equations to model relationships between independent and dependent variables. This article covers regression scenarios (house...
Spark MLlib Logistic Regression: Sigmoid, Loss Function a...
Logistic regression is a classification model in machine learning — an efficient binary classification algorithm widely used in ad click-through rate...
Spark MLlib Linear Regression: Scenarios, Loss Function a...
Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent...
Spark Streaming Integration with Kafka: Receiver and Dire...
Detailed explanation of two Spark Streaming integration modes with Kafka: Receiver-based high-level API vs Direct mode architecture differences, offset management, Exactly-Once semantics guarantee,...
Spark DStream Transformation Operators: map, reduceByKey,...
Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blacklist filtering: leftOuterJ...
Spark Streaming Window Operations & State Tracking: updat...
In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-state maintenance and mapWithSt...
Spark Streaming Introduction: From DStream to Structured ...
Introduction to Spark's two generations of real-time computing frameworks: DStream micro-batch processing model's architecture and limitations, and how Structured Streaming solves EventTime process...
Spark Streaming Data Sources: File Stream, Socket, RDD Qu...
Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation, with complete Scala code exam...
SparkSQL Statements: DataFrame Operations, SQL Queries & ...
Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for metadata and table operations.
SparkSQL Kernel: Five Join Strategies & Catalyst Optimize...
Deep dive into SparkSQL's five Join execution strategies (BHJ, SHJ, SMJ, Cartesian, BNLJ) selection conditions and use cases, along with the complete processing flow of Catalyst optimizer from SQL ...
SparkSQL Core Abstractions: RDD, DataFrame, Dataset & Spa...
Deep comparison of Spark's three data abstractions RDD, DataFrame, Dataset features and use cases, introduction to SparkSession unified entry, and demonstration of mutual conversion methods between...
SparkSQL Operators: Transformation & Action Operations
Systematically review SparkSQL Transformation and Action operators, covering select, filter, join, groupBy, union operations, with practical test cases demonstrating usage and performance optimizat...
Spark Standalone Mode: Architecture & Performance Tuning
Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and RDD optimization strategies.
SparkSQL Introduction: SQL & Distributed Computing Fusion
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integration with Hive/HDFS.
Spark RDD Fault Tolerance: Checkpoint Principle & Best Pr...
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long dependency chain scenarios.
Spark Broadcast Variables: Efficient Shared Read-Only Data
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices, and performance optimization solution using broadcast to implement MapSideJoin inste...
Spark Super WordCount: Text Cleaning & MySQL Persistence
Implement complete production-ready word frequency pipeline: lowercase conversion, punctuation removal, stop word filtering, word frequency counting, finally efficiently write to MySQL via foreachP...
Spark Serialization & RDD Execution Principle
Deep dive into Spark Driver-Executor process communication, Java/Kryo serialization selection, closure serialization problem troubleshooting, and RDD dependencies, Stage division and persistence st...
Spark Scala WordCount Implementation
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Spark Scala Practice: Pi Estimation & Mutual Friends
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two approaches, comparing Cartesian...
Spark Action Operations Overview
Comprehensive introduction to Spark RDD Action operations, covering data collection, statistical aggregation, element retrieval, storage output categories, and detailed explanation of Key-Value RDD...
Spark Cluster Architecture & Deployment Modes
Deep dive into Spark cluster core components Driver, Cluster Manager, Executor responsibilities, comparison of Standalone, YARN, Kubernetes deployment modes, and static vs dynamic resource allocati...
Spark RDD Deep Dive: Five Key Features
Comprehensive analysis of Spark core data abstraction RDD's five key features (partitions, compute function, dependencies, partitioner, preferred locations), lazy evaluation, fault tolerance, and n...
Spark RDD Creation & Transformation Operations
Detailed explanation of three RDD creation methods (parallelize, textFile, transform from existing RDD), and usage of common Transformation operators like map, filter, flatMap, groupBy, sortBy with...
From MapReduce to Spark: Big Data Computing Evolution
Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core components.
Spark Distributed Environment Setup
Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.sh core config adjustments, and complete multi...
Spark Streaming Kafka Consumption: Offset Acquisition, St...
When Spark Streaming integrates with Kafka, Offset management is key to ensuring data processing continuity and consistency. Offset marks message position in...
Spark Streaming Integration with Kafka: Offset Management...
Offset is used to mark message position in Kafka partition. Proper management can achieve at-least-once or even exactly-once data processing semantics. By persisting Offset, application can resume ...
Spark Streaming Stateful Transformations: Window Operatio...
Window operations integrate data from multiple batches over a longer time range by setting window length and slide duration. Cases demonstrate reduceByWindow...
Spark Streaming Integration with Kafka: Receiver and Dire...
This article introduces two Spark Streaming integration methods with Kafka: Receiver Approach and Direct Approach. Receiver uses Executor-based Receiver to...