Tag: Machine Learning
34 articles
AI Investigation #91: Multi-modal Data Annotation Tools - From Label Studio to 3D Point Cloud Labeling
In robot vision and perception model training, high-quality multi-modal data annotation tools are crucial.
Big Data 278 - Spark MLlib GBDT Case Study: Residuals, Regression Trees & Iterative Training
GBDT practical case study walking through the complete process from residual calculation to regression tree construction and iterative training.
Spark MLlib: Bagging vs Boosting Differences and GBDT Gradient Boosting
Introduces the differences between Bagging and Boosting in machine learning, and the GBDT (Gradient Boosting Decision Tree) algorithm principles.
Spark MLlib GBDT Algorithm: Gradient Boosting Principles,and Applications
This article introduces the principles and applications of gradient boosting tree (GBDT) algorithm.
Spark MLlib Ensemble Learning: Random Forest, Bagging and Boosting Methods
This article systematically introduces ensemble learning methods in machine learning.
Spark MLlib Decision Tree Pruning: Pre-pruning, Post-Principles and Practice
This article systematically introduces decision tree pre-pruning and post-pruning principles, compares core differences between three mainstream algorithms.
Spark MLlib Decision Tree: Classification Principles, Gini/Entropy and Practice
This article introduces the basic concepts, classification principles, and classification principles of decision trees.
Big Data 272 - Spark MLlib Logistic Regression: Basics, Input Function, Sigmoid & Loss
This article introduces the basic principles, application scenarios, and implementation in Spark MLlib of logistic regression.
Big Data 271 - Spark MLlib Linear Regression: Scenarios, Loss Function & Optimization
Linear regression uses regression equations to model relationships between independent and dependent variables.
Big Data 271 - Spark MLlib Logistic Regression: Sigmoid, Loss Function & Diabetes Prediction Case
Logistic Regression is a classification model in machine learning. Despite having "regression" in its name, it is a classification algorithm.
Spark MLlib Linear Regression: Scenarios, Loss Function and Optimization
Linear Regression is an analytical method that uses regression equations to model the relationship between one or more independent variables and a dependent.
sklearn KMeans Key Attributes & Evaluation: cluster_cluster_centers_、inertia_、metrics
Scenario: Using sklearn for KMeans clustering, want to explain centroids/loss and use metrics for K selection.
Big Data 216 - KMeans n_clusters Selection
KMeans nclusters selection method: calculate silhouettescore and silhouette_samples on candidate cluster numbers (e.g.
Big Data 213 - Python Hand-Written K-Means Clustering
Scenario: Hand-write K-Means using NumPy/Pandas, perform 3-class clustering on Iris.txt and output centroids with clustering results.
Big Data 214 - K-Means Clustering Practice: Self-Implemented Algorithm vs sklearn
K-Means clustering provides an engineering workflow that is 'verifiable, reproducible, and debuggable': first use 2D testSet dataset for algorithm verification.
Big Data 211 - Scikit-Learn Logistic Regression Implementation
When using Logistic Regression in Scikit-Learn, max_iter controls maximum iterations affecting model convergence speed and accuracy.
Big Data 212 - K-Means Clustering Guide
K-Means clustering algorithm, comparing supervised vs unsupervised learning (whether labels Y are needed).
Big Data 209 - Deep Understanding of Logistic Regression
Logistic Regression (LR) is an important classification algorithm in machine learning.
Big Data 210 - How to Implement Logistic Regression in Scikit-Learn and Regularization Detailed (L1 and L2)
As C gradually increases, regularization strength gets smaller, model performance on training and test shows upward trend, until around C=0.8.
Big Data 207 - How to Handle Multicollinearity
When using scikit-learn for linear regression, how to handle multicollinearity in least squares method.
Big Data 208 - Ridge Regression and Lasso Regression
Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine.
Big Data 205 - Linear Regression Machine Learning Perspective
Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown
Big Data 206 - NumPy Matrix Multiplication Hand-written Multivariate Linear Regression
pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation).
Big Data 203 - sklearn Decision Tree Pruning Parameters
Common parameters for decision tree pruning (pre-pruning) in engineering: maxdepth, minsamplesleaf, minsamplessplit, maxfeatures, minimpuritydecrease.
Big Data 204 - Confusion Matrix to ROC: Imbalanced Binary Classification Metrics in sklearn
Confusion matrix (TP, FP, FN, TN) with unified metrics: Accuracy, Precision, Recall (Sensitivity), F1 Measure, ROC curve, AUC value, and practical business interpretation...
Big Data 201 - Decision Tree from Split to Pruning
Decision tree is a tree-structured supervised learning model, commonly used for classification and regression tasks.
Big Data 202 - sklearn Decision Tree Practice: criterion, Graphviz Visualization & Pruning
Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version).
Big Data 199 - Decision Tree Model Explained: Node Structure, Conditional Probability & Shannon Entropy
Tree model is a widely used algorithm type in supervised learning, can be applied to both classification and regression problems.
Big Data 200 - Decision Tree Information Gain Detailed
Scenario: Use information entropy/information gain to explain why decision tree selects certain column for splitting, and use Python to reproduce "best split column".
Big Data 197 - K-Fold Cross-Validation Practice
Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation.
Big Data 198 - KNN Must Normalize First: Min-Max Scaling, Data Leakage Pitfalls & sklearn Practice
In scikit-learn pipelines, distance-based models like KNN are highly sensitive to inconsistent feature scales. Split first, fit MinMaxScaler only on the training set...
Big Data 195 - KNN/K-Nearest Neighbors Algorithm Practice
KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python.
Big Data 196 - scikit-learn KNN Practice: KNeighborsClassifier, kneighbors & Learning Curves
Since being initiated in 2007 by David Cournapeau, scikit-learn (sklearn) has become one of the most important machine learning libraries in the Python ecosystem.
Big Data 194 - Data Mining Overview: From Wine Classification to Supervised, Unsupervised & Reinforcement Learning
In a bar, there are ten almost identical glasses of wine on the counter. The boss says want to play a game, win and drink for free, lose and pay three times for the wine.