Tag: Spark

40 articles

AI Investigation #51: Big Data Technology Evolution - Obsolete Frameworks, Architectures and the Reasons Behind Them

Big data technology evolution: MapReduce replaced by Spark, Storm replaced by Flink, Pig/Hive gradually phased out.

8/14/2025

AI Investigation #50: Big Data Evolution - Two Decades of Architectural Change from Hadoop to Flink

Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing.

8/13/2025

AI Research 49 - Big Data Survey Report: Development History from 1997 to 2025

Big data development began in 1997 when NASA proposed the concept, 2003-2006 Google published GFS, MapReduce, Bigtable three major papers leading distributed computing re...

8/12/2025

Big Data 278 - Spark MLlib GBDT Case Study: Residuals, Regression Trees & Iterative Training

GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training.

6/4/2025

Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting

Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles.

6/3/2025

Spark MLlib GBDT Algorithm: Gradient Boosting Principles,and Applications

This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm.

6/3/2025

Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Methods

This article systematically introduces ensemble learning methods in machine learning.

6/2/2025

Spark MLlib Decision Tree Pruning: Pre-pruning, Post-Principles and Practice

This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms.

5/29/2025

Spark MLlib Decision Tree: Classification Principles, Gini/Entropy and Practice

This article introduces the basic concepts, classification principles, and classification principles of decision trees.

5/28/2025

Big Data 272 - Spark MLlib Logistic Regression: Basics, Input Function, Sigmoid & Loss

This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression.

5/27/2025

Big Data 271 - Spark MLlib Linear Regression: Scenarios, Loss Function & Optimization

Linear regression uses regression equations to model relationships between independent and dependent variables.

4/11/2025

Big Data 271 - Spark MLlib Logistic Regression: Sigmoid, Loss Function & Diabetes Prediction Case

Logistic Regression is a classification model in machine learning. Despite having "regression" in its name, it is a classification algorithm.

1/3/2025

Spark MLlib Linear Regression: Scenarios, Loss Function and Optimization

Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent.

1/2/2025

Big Data 89 - Spark Streaming with Kafka: Receiver vs Direct Mode

This is article 89 in the Big Data series, deeply comparing two core modes of Spark Streaming integration with Kafka, focusing on Direct mode production practices.

11/20/2024

Big Data 87 - Spark DStream Transformation Operators: map, reduceByKey and transform

Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blac...

11/16/2024

Spark Streaming Window Operations & State Tracking: updateStateByKey & mapWithState

In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-stat...

11/16/2024

Spark Streaming Introduction: From DStream to Structured Streaming

This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark's two generations of streaming frameworks.

11/13/2024

Spark Streaming Data Sources: File Stream, Socket, RDD RDD Queue

Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation.

11/13/2024

SparkSQL Statements: DataFrame Operations, SQL Queries &

Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for meta...

11/9/2024

Big Data 84 - SparkSQL Internals: Five Join Strategies & Catalyst Optimizer

This is article 84 in the Big Data series, deeply analyzing SparkSQL kernel's Join strategy auto-selection logic and SQL parsing optimization flow.

11/9/2024

SparkSQL Core Abstractions: RDD, DataFrame, Dataset & SparkSession

This is article 81 in the Big Data series, comprehensively introducing Spark's three core data abstractions' features, use cases and mutual conversions.

11/6/2024

SparkSQL Operators: Transformation & Action Operations

This is article 82 in the Big Data series, systematically introducing SparkSQL Transformation and Action operators with complete test cases.

11/6/2024

Spark Standalone Mode: Architecture & Performance Tuning

Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and...

11/2/2024

SparkSQL Introduction: SQL & Distributed Computing Fusion

Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integra...

11/2/2024

Spark RDD Fault Tolerance: Checkpoint Principle & Best Best Practices

Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long...

10/30/2024

Spark Broadcast Variables: Efficient Shared Read-Only Data

Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices.

10/30/2024

Spark Super WordCount: Text Cleaning & MySQL Persistence

This is article 75 in the Big Data series, on top of basic WordCount add text preprocessing and database persistence, build a near-production word frequency pipeline.

10/26/2024

Spark Serialization & RDD Execution Principle

This is article 76 in the Big Data series, systematically reviewing Spark process communication mechanism, serialization strategy and RDD execution principle.

10/26/2024

Spark Scala WordCount Implementation

Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.

10/23/2024

Spark Scala Practice: Pi Estimation & Mutual Friends

Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two appr...

10/23/2024

Spark Action Operations Overview

This is article 72 in the Big Data series, systematically reviewing Spark RDD Action operators.

10/19/2024

Spark Cluster Architecture & Deployment Modes

This is article 71 in the Big Data series, introducing Spark cluster core architecture, deployment mode comparisons, and static/dynamic resource management strategies.

10/16/2024

Spark RDD Deep Dive: Five Key Features

This is article 69 in the Big Data series, deeply analyzing RDD, Spark's core data abstraction, its five key features and design principles.

10/12/2024

Spark RDD Creation & Transformation Operations

This is article 70 in the Big Data series, comprehensively explaining Spark RDD's three creation methods and practical usage of common Transformation operators.

10/12/2024

From MapReduce to Spark: Big Data Computing Evolution

Systematic overview of big data processing engine evolution from MapReduce to Spark to Flink, analyzing Spark in-memory computing model, unified ecosystem and core compon...

10/9/2024

Spark Distributed Environment Setup

Step-by-step Apache Spark distributed computing environment setup, covering download and extract, environment variable configuration, slaves/spark-env.

9/18/2024

Spark Streaming Kafka Consumption: Offset Acquisition, Storage and Management

When Spark Streaming integrates with Kafka, Offset management is key to ensuring data processing continuity and consistency.

8/27/2024

Big Data 104 - Spark Streaming with Kafka: Offset Management Mechanisms & Best Practices

Offset is used to mark message position in Kafka partition. Proper management can achieve at-least-once or even exactly-once data processing semantics.

8/27/2024

Spark Streaming Stateful Transformations: Window Operations and State Tracking

Window operations integrate data from multiple batches over a longer time range by setting window length and slide duration.

8/26/2024

Big Data 102 - Spark Streaming with Kafka: Receiver and Direct Approaches

This article introduces two Spark Streaming integration methods with Kafka: Receiver Approach and Direct Approach.

8/26/2024