Tag: Scala
25 articles
Spark MLlib GBDT Algorithm: Gradient Boosting Principles,and Applications
This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm.
Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Methods
This article systematically introduces ensemble learning methods in machine learning.
Spark MLlib Decision Tree Pruning: Pre-pruning, Post-Principles and Practice
This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms.
Spark MLlib Decision Tree: Classification Principles, Gini/Entropy and Practice
This article introduces the basic concepts, classification principles, and classification principles of decision trees.
Big Data 272 - Spark MLlib Logistic Regression: Basics, Input Function, Sigmoid & Loss
This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression.
Big Data 271 - Spark MLlib Linear Regression: Scenarios, Loss Function & Optimization
Linear regression uses regression equations to model relationships between independent and dependent variables.
Big Data 271 - Spark MLlib Logistic Regression: Sigmoid, Loss Function & Diabetes Prediction Case
Logistic Regression is a classification model in machine learning. Despite having "regression" in its name, it is a classification algorithm.
Spark MLlib Linear Regression: Scenarios, Loss Function and Optimization
Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent.
Big Data 89 - Spark Streaming with Kafka: Receiver vs Direct Mode
This is article 89 in the Big Data series, deeply comparing two core modes of Spark Streaming integration with Kafka, focusing on Direct mode production practices.
Big Data 87 - Spark DStream Transformation Operators: map, reduceByKey and transform
Systematically review Spark Streaming DStream stateless transformation operators and transform advanced operations, demonstrating three implementation approaches for blac...
Spark Streaming Window Operations & State Tracking: updateStateByKey & mapWithState
In-depth explanation of Spark Streaming stateful computing: window operation parameter configuration, reduceByKeyAndWindow hot word statistics, updateStateByKey full-stat...
Spark Streaming Introduction: From DStream to Structured Streaming
This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark's two generations of streaming frameworks.
Spark Streaming Data Sources: File Stream, Socket, RDD RDD Queue
Comprehensive explanation of three Spark Streaming basic data sources: file stream directory monitoring, Socket TCP ingestion, RDD queue stream for testing simulation.
SparkSQL Statements: DataFrame Operations, SQL Queries &
Comprehensive guide to SparkSQL core usage including DataFrame API operations, SQL query syntax, lateral view explode, and Hive integration via enableHiveSupport for meta...
Big Data 84 - SparkSQL Internals: Five Join Strategies & Catalyst Optimizer
This is article 84 in the Big Data series, deeply analyzing SparkSQL kernel's Join strategy auto-selection logic and SQL parsing optimization flow.
SparkSQL Core Abstractions: RDD, DataFrame, Dataset & SparkSession
This is article 81 in the Big Data series, comprehensively introducing Spark's three core data abstractions' features, use cases and mutual conversions.
SparkSQL Operators: Transformation & Action Operations
This is article 82 in the Big Data series, systematically introducing SparkSQL Transformation and Action operators with complete test cases.
SparkSQL Introduction: SQL & Distributed Computing Fusion
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integra...
Spark RDD Fault Tolerance: Checkpoint Principle & Best Best Practices
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long...
Spark Broadcast Variables: Efficient Shared Read-Only Data
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices.
Spark Super WordCount: Text Cleaning & MySQL Persistence
This is article 75 in the Big Data series, on top of basic WordCount add text preprocessing and database persistence, build a near-production word frequency pipeline.
Spark Serialization & RDD Execution Principle
This is article 76 in the Big Data series, systematically reviewing Spark process communication mechanism, serialization strategy and RDD execution principle.
Spark Scala WordCount Implementation
Implement distributed WordCount using Spark + Scala and Spark + Java, detailed RDD five-step processing flow, Maven project configuration and spark-submit command.
Spark Scala Practice: Pi Estimation & Mutual Friends
Deep dive into Spark RDD programming through two classic cases: Monte Carlo method distributed Pi estimation, and mutual friends analysis in social networks with two appr...
Spark Action Operations Overview
This is article 72 in the Big Data series, systematically reviewing Spark RDD Action operators.