Tag: Machine Learning

34 articles

AI Investigation #91: Multi-modal Data Annotation Tools - From Label Studio to 3D Point Cloud Labeling

In robot vision and perception model training, high-quality multi-modal data annotation tools are crucial.

9/30/2025

Big Data 278 - Spark MLlib GBDT Case Study: Residuals, Regression Trees & Iterative Training

GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training.

6/4/2025

Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting

Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles.

6/3/2025

Spark MLlib GBDT Algorithm: Gradient Boosting Principles,and Applications

This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm.

6/3/2025

Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Methods

This article systematically introduces ensemble learning methods in machine learning.

6/2/2025

Spark MLlib Decision Tree Pruning: Pre-pruning, Post-Principles and Practice

This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms.

5/29/2025

Spark MLlib Decision Tree: Classification Principles, Gini/Entropy and Practice

This article introduces the basic concepts, classification principles, and classification principles of decision trees.

5/28/2025

Big Data 272 - Spark MLlib Logistic Regression: Basics, Input Function, Sigmoid & Loss

This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression.

5/27/2025

Big Data 271 - Spark MLlib Linear Regression: Scenarios, Loss Function & Optimization

Linear regression uses regression equations to model relationships between independent and dependent variables.

4/11/2025

Big Data 271 - Spark MLlib Logistic Regression: Sigmoid, Loss Function & Diabetes Prediction Case

Logistic Regression is a classification model in machine learning. Despite having "regression" in its name, it is a classification algorithm.

1/3/2025

Spark MLlib Linear Regression: Scenarios, Loss Function and Optimization

Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent.

1/2/2025

sklearn KMeans Key Attributes & Evaluation: cluster_cluster_centers_、inertia_、metrics

Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.

11/9/2024

Big Data 216 - KMeans n_clusters Selection

KMeans nclusters selection method: calculate silhouettescore and silhouette_samples on candidate cluster numbers (e.g.

11/9/2024

Big Data 213 - Python Hand-Written K-Means Clustering

Scenario: Hand-write K-Means using NumPy/Pandas, perform 3-class clustering on Iris.txt and output centroids with clustering results.

11/8/2024

Big Data 214 - K-Means Clustering Practice: Self-Implemented Algorithm vs sklearn

K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification.

11/8/2024

Big Data 211 - Scikit-Learn Logistic Regression Implementation

When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy.

11/7/2024

Big Data 212 - K-Means Clustering Guide

K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed).

11/7/2024

Big Data 209 - Deep Understanding of Logistic Regression

Logistic Regression (LR) is an important classification algorithm in machine learning.

11/6/2024

Big Data 210 - How to Implement Logistic Regression in Scikit-Learn and Regularization Detailed (L1 and L2)

As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8.

11/6/2024

Big Data 207 - How to Handle Multicollinearity

When using scikit-learn for linear regression, how to handle multicollinearity in least squares method.

11/5/2024

Big Data 208 - Ridge Regression and Lasso Regression

Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine.

11/5/2024

Big Data 205 - Linear Regression Machine Learning Perspective

Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown

11/4/2024

Big Data 206 - NumPy Matrix Multiplication Hand-written Multivariate Linear Regression

pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation).

11/4/2024

Big Data 203 - sklearn Decision Tree Pruning Parameters

Common parameters for decision tree pruning (pre-pruning) in engineering: maxdepth, minsamplesleaf, minsamplessplit, maxfeatures, minimpuritydecrease.

11/2/2024

Big Data 204 - Confusion Matrix to ROC: Imbalanced Binary Classification Metrics in sklearn

Confusion matrix (TP, FP, FN, TN) with unified metrics: Accuracy, Precision, Recall (Sensitivity), F1 Measure, ROC curve, AUC value, and practical business interpretation...

11/2/2024

Big Data 201 - Decision Tree from Split to Pruning

Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks.

11/1/2024

Big Data 202 - sklearn Decision Tree Practice: criterion, Graphviz Visualization & Pruning

Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version).

11/1/2024

Big Data 199 - Decision Tree Model Explained: Node Structure, Conditional Probability & Shannon Entropy

Tree model is a widely used algorithm type in supervised learning, can be applied to both classification and regression problems.

10/31/2024

Big Data 200 - Decision Tree Information Gain Detailed

Scenario: Use information entropy/information gain to explain why decision tree selects certain column for splitting, and use Python to reproduce "best split column".

10/31/2024

Big Data 197 - K-Fold Cross-Validation Practice

Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation.

10/30/2024

Big Data 198 - KNN Must Normalize First: Min-Max Scaling, Data Leakage Pitfalls & sklearn Practice

In scikit-learn pipelines, distance-based models like KNN are highly sensitive to inconsistent feature scales. Split first, fit MinMaxScaler only on the training set...

10/30/2024

Big Data 195 - KNN/K-Nearest Neighbors Algorithm Practice

KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python.

10/29/2024

Big Data 196 - scikit-learn KNN Practice: KNeighborsClassifier, kneighbors & Learning Curves

Since being initiated in 2007 by David Cournapeau, scikit-learn (sklearn) has become one of the most important machine learning libraries in the Python ecosystem.

10/29/2024

Big Data 194 - Data Mining Overview: From Wine Classification to Supervised, Unsupervised & Reinforcement Learning

In a bar, there are ten almost identical glasses of wine on the counter. The boss says want to play a game, win and drink for free, lose and pay three times for the wine.

10/28/2024