Tag: Machine Learning

34 articles

AI Investigation #91: Multi-modal Data Annotation Tools - From Label Studio to 3D Point Cloud Labeling

In robot vision and perception model training, high-quality multi-modal data annotation tools are crucial.

Big Data 278 - Spark MLlib GBDT Case Study: Residuals, Regression Trees & Iterative Training

GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training.

Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting

Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles.

Spark MLlib GBDT Algorithm: Gradient Boosting Principles,and Applications

This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm.

Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Methods

This article systematically introduces ensemble learning methods in machine learning.

Spark MLlib Decision Tree Pruning: Pre-pruning, Post-Principles and Practice

This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms.

Spark MLlib Decision Tree: Classification Principles, Gini/Entropy and Practice

This article introduces the basic concepts, classification principles, and classification principles of decision trees.

Big Data 272 - Spark MLlib Logistic Regression: Basics, Input Function, Sigmoid & Loss

This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression.

Big Data 271 - Spark MLlib Linear Regression: Scenarios, Loss Function & Optimization

Linear regression uses regression equations to model relationships between independent and dependent variables.

Big Data 271 - Spark MLlib Logistic Regression: Sigmoid, Loss Function & Diabetes Prediction Case

Logistic Regression is a classification model in machine learning. Despite having "regression" in its name, it is a classification algorithm.

Spark MLlib Linear Regression: Scenarios, Loss Function and Optimization

Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent.

sklearn KMeans Key Attributes & Evaluation: cluster_cluster_centers_、inertia_、metrics

Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.

Big Data 216 - KMeans n_clusters Selection

KMeans nclusters selection method: calculate silhouettescore and silhouette_samples on candidate cluster numbers (e.g.

Big Data 213 - Python Hand-Written K-Means Clustering

Scenario: Hand-write K-Means using NumPy/Pandas, perform 3-class clustering on Iris.txt and output centroids with clustering results.

Big Data 214 - K-Means Clustering Practice: Self-Implemented Algorithm vs sklearn

K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification.

Big Data 211 - Scikit-Learn Logistic Regression Implementation

When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy.

Big Data 212 - K-Means Clustering Guide

K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed).

Big Data 209 - Deep Understanding of Logistic Regression

Logistic Regression (LR) is an important classification algorithm in machine learning.

Big Data 210 - How to Implement Logistic Regression in Scikit-Learn and Regularization Detailed (L1 and L2)

As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8.

Big Data 207 - How to Handle Multicollinearity

When using scikit-learn for linear regression, how to handle multicollinearity in least squares method.

Big Data 208 - Ridge Regression and Lasso Regression

Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine.

Big Data 205 - Linear Regression Machine Learning Perspective

Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown

Big Data 206 - NumPy Matrix Multiplication Hand-written Multivariate Linear Regression

pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation).

Big Data 203 - sklearn Decision Tree Pruning Parameters

Common parameters for decision tree pruning (pre-pruning) in engineering: maxdepth, minsamplesleaf, minsamplessplit, maxfeatures, minimpuritydecrease.

Big Data 204 - Confusion Matrix to ROC: Imbalanced Binary Classification Metrics in sklearn

Confusion matrix (TP, FP, FN, TN) with unified metrics: Accuracy, Precision, Recall (Sensitivity), F1 Measure, ROC curve, AUC value, and practical business interpretation...

Big Data 201 - Decision Tree from Split to Pruning

Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks.

Big Data 202 - sklearn Decision Tree Practice: criterion, Graphviz Visualization & Pruning

Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version).

Big Data 199 - Decision Tree Model Explained: Node Structure, Conditional Probability & Shannon Entropy

Tree model is a widely used algorithm type in supervised learning, can be applied to both classification and regression problems.

Big Data 200 - Decision Tree Information Gain Detailed

Scenario: Use information entropy/information gain to explain why decision tree selects certain column for splitting, and use Python to reproduce "best split column".

Big Data 197 - K-Fold Cross-Validation Practice

Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation.

Big Data 198 - KNN Must Normalize First: Min-Max Scaling, Data Leakage Pitfalls & sklearn Practice

In scikit-learn pipelines, distance-based models like KNN are highly sensitive to inconsistent feature scales. Split first, fit MinMaxScaler only on the training set...

Big Data 195 - KNN/K-Nearest Neighbors Algorithm Practice

KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python.

Big Data 196 - scikit-learn KNN Practice: KNeighborsClassifier, kneighbors & Learning Curves

Since being initiated in 2007 by David Cournapeau, scikit-learn (sklearn) has become one of the most important machine learning libraries in the Python ecosystem.

Big Data 194 - Data Mining Overview: From Wine Classification to Supervised, Unsupervised & Reinforcement Learning

In a bar, there are ten almost identical glasses of wine on the counter. The boss says want to play a game, win and drink for free, lose and pay three times for the wine.