Blog
Technical exploration and thoughts · 655 articles
How to Handle Multicollinearity: Common Problems & Soluti...
When using scikit-learn for linear regression, how to handle multicollinearity in least squares method. Multicollinearity may cause instability in regression...
Ridge Regression and Lasso Regression: Differences, Appli...
Ridge Regression and Lasso Regression are two commonly used linear regression regularization methods for solving overfitting and multicollinearity in machine...
Linear Regression Machine Learning Perspective: Matrix Re...
Linear Regression core chain: unify prediction function y=Xw in matrix form, treat parameter vector w as only unknown; use loss function to characterize...
NumPy Matrix Multiplication Hand-written Multivariate Lin...
pandas DataFrame and NumPy matrix multiplication hand-written multivariate linear regression (linear regression implementation). Core idea is to form normal...
sklearn Decision Tree Pruning Parameters: max_depth/min_s...
Common parameters for decision tree pruning (pre-pruning) in engineering: max_depth, min_samples_leaf, min_samples_split, max_features, min_impurity_decrease...
Confusion Matrix to ROC: Complete Review of Imbalanced Bi...
Confusion matrix (TP, FP, FN, TN) with unified metrics: Accuracy, Precision, Recall (Sensitivity), F1 Measure, ROC curve, AUC value, and practical business interpretation for classification models.
Spark Standalone Mode: Architecture & Performance Tuning
Comprehensive explanation of Spark Standalone cluster four core components, application submission flow, SparkContext internal architecture, Shuffle evolution history and RDD optimization strategies.
SparkSQL Introduction: SQL & Distributed Computing Fusion
Systematic introduction to SparkSQL evolution history, core abstractions DataFrame/Dataset, Catalyst optimizer principle, and practical usage of multi-data source integration with Hive/HDFS.
Decision Tree from Split to Pruning: Information Gain/Gai...
Complete chain from 'split' to 'pruning', explain why usually uses greedy algorithm to form 'local optimum', and differences in splitting criteria between...
sklearn Decision Tree Practice: criterion, Graphviz Visua...
Complete flow of DecisionTreeClassifier on load_wine dataset from data splitting, model evaluation to decision tree visualization (2026 version). Focus on...
Decision Tree Model Detailed: Node Structure, Conditional...
Decision Tree model systematic overview for classification tasks: three types of nodes (root/internal/leaf), recursive split flow from root to leaf, and...
Decision Tree Information Gain Detailed: Information Entr...
Decision tree information gain (Information Gain) explained, first using information entropy (Entropy) to explain impurity, then explaining why when splitting...
K-Fold Cross-Validation Practice: sklearn Look at Mean/Va...
Random train/test split causes evaluation metrics to be unstable, and gives engineering solution: K-Fold Cross Validation. Through sklearn's cross_val_score to...
KNN Must Normalize First: Min-Max Correct Method, Data Le...
In scikit-learn machine learning training pipeline, distance-based models like KNN are extremely sensitive to inconsistent feature scales: Euclidean distance...
Spark RDD Fault Tolerance: Checkpoint Principle & Best Pr...
Detailed explanation of Spark Checkpoint execution flow, core differences with persist/cache, partitioner strategies, and best practices for iterative algorithms and long dependency chain scenarios.
Spark Broadcast Variables: Efficient Shared Read-Only Data
Detailed explanation of Spark broadcast variable working principle, configuration parameters and best practices, and performance optimization solution using broadcast to implement MapSideJoin inste...
KNN/K-Nearest Neighbors Algorithm Practice: Euclidean Dis...
KNN/K-Nearest Neighbors Algorithm: From Euclidean distance calculation, distance sorting, TopK voting to function encapsulation, giving reproducible Python...
scikit-learn KNN Practice: KNeighborsClassifier, kneighbo...
From unified API (fit/predict/transform/score) to kneighbors to find K nearest neighbors of test samples, then using learning curve/parameter curve to select...
Apache Tez Practice: Hive on Tez Installation & Configura...
Apache Tez (example version Tez 0.9.x) as execution engine alternative to MapReduce on Hadoop2/YARN, providing DAG (Directed Acyclic Graph) execution model for...
Data Mining: From Wine Classification to Machine Learning...
2025's most commonly used machine learning concept framework: supervised learning (classification/regression), unsupervised learning (clustering/dimensionality...