Big Data 194 - Data Mining: Machine Learning Overview
Simple Case
In a bar, there are ten almost identical glasses of wine on the counter. The boss says want to play a game, win and drink for free, lose and pay three times for the wine. These ten glasses of wine, each slightly different, the first five belong to [Cabernet Sauvignon], the last five belong to [Pinot Noir]. Now pour another glass of wine, you need to correctly identify which category it belongs to.
Algorithm System
Machine Learning (ML) is a branch of Artificial Intelligence (AI), enabling computer systems to learn and make decisions like humans without explicit programming instructions. Core of machine learning is extracting patterns from data and using these patterns to predict or classify new data.
Methods in machine learning are model algorithms generated from data, also called learning algorithms. Includes:
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Reinforcement Learning
Supervised Learning
Refers to modeling the relationship between several features and several labels (types) in data. Once the model is determined, it can be applied to new unknown data. This learning process can be further divided into: Classification and Regression tasks.
- In classification tasks, labels are discrete values
- In regression tasks, labels are continuous values
Supervised learning refers to algorithms that rely on labeled datasets during training. Each sample in the dataset has a corresponding correct output, and the algorithm learns how to predict output from input data through these “input-output” pairs.
Applications: Classification problems (spam detection), Regression problems (house price prediction) Common Algorithms: Linear Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), Neural Networks
Unsupervised Learning
Unsupervised learning is an important learning paradigm in machine learning, with the core characteristic of not relying on manually labeled label data. Unlike supervised learning, unsupervised learning lets algorithms autonomously explore data’s intrinsic structure and patterns, a typical “data-driven” learning approach.
Main Task Types
-
Clustering
- Goal: Automatically group similar data points
- Typical algorithms: K-means clustering, Hierarchical clustering, DBSCAN
- Application scenarios: Customer segmentation, Document topic classification, Anomaly detection
-
Dimensionality Reduction
- Goal: Reduce feature count while preserving important information
- Typical methods: PCA, t-SNE, Autoencoder
- Application scenarios: High-dimensional data visualization, Feature engineering preprocessing, Image compression
Semi-supervised Learning
Semi-supervised learning methods are between supervised and unsupervised learning, can be used when data is incomplete.
Reinforcement Learning
Reinforcement learning is a machine learning method based on trial-and-error mechanism, fundamentally different from supervised learning. Supervised learning relies on pre-labeled training data, while reinforcement learning learns optimal strategies through dynamic interaction with the environment.
The core mechanism of reinforcement learning can be summarized as “state-action-reward” loop:
- Agent observes current environment state
- Selects and executes an action based on current strategy
- Environment gives corresponding immediate reward and new state
- Agent updates strategy based on reward signal
Typical Application Scenarios:
- Robot control: Robotic arm grasping objects
- Autonomous driving: Vehicles learn how to drive safely and efficiently through simulation
- Game AI: AlphaGo continuously optimizes chess strategy through self-play
- Resource scheduling: Data centers optimize server resource allocation through reinforcement learning
Common Algorithm Types:
- Value-based algorithms (like Q-learning)
- Policy-based algorithms (like Policy Gradient)
- Actor-Critic algorithms
Input/Output Space and Feature Space
In the scenario, each glass of wine is a sample, ten glasses form a sample set. Alcohol concentration, color depth and other information are called [Features]. These ten glasses are distributed in a [Multi-dimensional Feature Space].
All samples entering the current program’s “learning system” are called [Input], forming [Input Space]. The random variable values generated during learning are called [Output], forming [Output Space].
In supervised learning, when output variables are all continuous variables, the prediction problem becomes a regression problem. When output variables are finite discrete variables, the prediction problem is called a classification problem.
Overfitting and Underfitting
When the hypothesis space contains models of different complexities, we face model selection. We want to acquire a learner that performs well on new samples.
Fit can be simply understood as the model’s degree of mastery of the objective laws behind the dataset. If model fit is poor, it doesn’t fully capture the laws, resulting in low accuracy when used for classification and prediction.
Overfitting: When the model learns the training samples too well, it likely has already captured some characteristics of the training samples as general properties of all potential samples. At this time, the selected model is often more complex than the true model, leading to degraded generalization performance.
Underfitting: Refers to low learning capability, where the general properties of training samples haven’t been well learned yet.
Overfitting Chart Explanation
- Left chart (first-order polynomial, underfitting): Training dataset accuracy and cross-validation dataset accuracy are close together, overall level relatively low, converging around 0.88
- Middle chart (third-order polynomial, good fit): Training dataset accuracy and cross-validation dataset accuracy are close together, overall level relatively high
- Right chart (tenth-order polynomial, overfitting): Training dataset accuracy very high (0.95), but cross-validation dataset accuracy lower (0.91), large gap between them
Machine Learning Workflow
-
Data Collection & Preprocessing: Data is the foundation of machine learning, typically collected from various sources, then cleaned, normalized, missing values processed and other preprocessing operations performed
-
Feature Engineering: Feature engineering refers to extracting useful features from raw data, including feature selection, feature scaling, encoding, etc.
-
Model Selection: Choose suitable algorithm model based on problem type (classification, regression, clustering, etc.)
-
Model Training: Input preprocessed data into selected machine learning algorithm, let model learn using training data
-
Model Evaluation: Evaluate model performance using test set, common evaluation metrics include accuracy, precision, recall, F1 score, mean squared error, etc.
-
Model Tuning: Further optimize model performance by adjusting model parameters or introducing more data
-
Model Deployment & Application: Once model passes evaluation, it can be deployed in actual applications
Common Machine Learning Algorithms
Linear Regression
Linear regression is the basic algorithm for solving regression problems, establishing a model by finding linear relationship between input variables (X) and output variables (Y). Mathematical expression: Y = β₀ + β₁X + ε.
Typical Applications: House price prediction, Sales forecasting
Logistic Regression
Logistic regression is a classic algorithm for solving binary classification problems. Although containing “regression” in the name, it’s actually a classification algorithm. It maps linear regression output to (0,1) interval through Sigmoid function.
Typical Applications: Spam detection, Disease diagnosis
Decision Tree
Decision tree is a tree-structured algorithm, building models by recursively dividing data space. Each internal node represents a feature test, each leaf node represents prediction result.
Typical Applications: Loan approval decisions, Customer classification
Random Forest
Random forest is an ensemble method of decision trees, improving performance by building multiple decision trees and combining their prediction results.
Advantages: Better generalization ability, can effectively reduce overfitting risk, can evaluate feature importance
Support Vector Machine (SVM)
Support vector machine is a powerful classification and regression algorithm. Its core is finding optimal decision boundary (hyperplane), maximizing margin between different categories.
Advantages: Theoretically complete, globally optimal, suitable for small sample, high-dimensional scenarios
K-Means Clustering
K-means clustering is an unsupervised learning algorithm, goal is to divide n data points into k clusters, making points within same cluster as close as possible, points in different clusters as far apart as possible.
Applications: Customer segmentation, Image compression
Neural Networks
Neural networks mimic biological neuron structure, consisting of input layer, hidden layer and output layer, implementing non-linear transformations through activation functions.
Application Domains: Computer Vision, Natural Language Processing, Speech Recognition
Challenges in Machine Learning
- Data Quality: Model performance largely depends on data quality and quantity
- Model Overfitting: Model performs excellently on training data but poorly on new data
- Interpretability: Complex machine learning models often difficult to interpret internal decision logic
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Treating “Logistic Regression” as regression problem | Misleading name: output is category probability, essentially classification | Clearly distinguish “regression/classification” by label type |
| Training set score high, validation set score obviously low | Overfitting | Reduce complexity, regularization, more data, cross-validation |
| Both training and validation sets low and close | Underfitting | Enhance features, use stronger model, relax hypothesis space |
| Reinforcement learning training unstable | Poor reward function design | Restructure rewards, add constraints/penalty items, adjust exploration rate |