Big Data 194 - Data Mining: Machine Learning Overview

Simple Case

In a bar, there are ten almost identical glasses of wine on the counter. The boss says want to play a game, win and drink for free, lose and pay three times for the wine. These ten glasses of wine, each slightly different, the first five belong to [Cabernet Sauvignon], the last five belong to [Pinot Noir]. Now pour another glass of wine, you need to correctly identify which category it belongs to.

Algorithm System

Machine Learning (ML) is a branch of Artificial Intelligence (AI), enabling computer systems to learn and make decisions like humans without explicit programming instructions. Core of machine learning is extracting patterns from data and using these patterns to predict or classify new data.

Methods in machine learning are model algorithms generated from data, also called learning algorithms. Includes:

Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning

Supervised Learning

Refers to modeling the relationship between several features and several labels (types) in data. Once the model is determined, it can be applied to new unknown data. This learning process can be further divided into: Classification and Regression tasks.

In classification tasks, labels are discrete values
In regression tasks, labels are continuous values

Supervised learning refers to algorithms that rely on labeled datasets during training. Each sample in the dataset has a corresponding correct output, and the algorithm learns how to predict output from input data through these “input-output” pairs.

Applications: Classification problems (spam detection), Regression problems (house price prediction) Common Algorithms: Linear Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), Neural Networks

Unsupervised Learning

Unsupervised learning is an important learning paradigm in machine learning, with the core characteristic of not relying on manually labeled label data. Unlike supervised learning, unsupervised learning lets algorithms autonomously explore data’s intrinsic structure and patterns, a typical “data-driven” learning approach.

Main Task Types

Clustering
- Goal: Automatically group similar data points
- Typical algorithms: K-means clustering, Hierarchical clustering, DBSCAN
- Application scenarios: Customer segmentation, Document topic classification, Anomaly detection
Dimensionality Reduction
- Goal: Reduce feature count while preserving important information
- Typical methods: PCA, t-SNE, Autoencoder
- Application scenarios: High-dimensional data visualization, Feature engineering preprocessing, Image compression

Semi-supervised Learning

Semi-supervised learning methods are between supervised and unsupervised learning, can be used when data is incomplete.

Reinforcement Learning

Reinforcement learning is a machine learning method based on trial-and-error mechanism, fundamentally different from supervised learning. Supervised learning relies on pre-labeled training data, while reinforcement learning learns optimal strategies through dynamic interaction with the environment.

The core mechanism of reinforcement learning can be summarized as “state-action-reward” loop:

Agent observes current environment state
Selects and executes an action based on current strategy
Environment gives corresponding immediate reward and new state
Agent updates strategy based on reward signal

Typical Application Scenarios:

Robot control: Robotic arm grasping objects
Autonomous driving: Vehicles learn how to drive safely and efficiently through simulation
Game AI: AlphaGo continuously optimizes chess strategy through self-play
Resource scheduling: Data centers optimize server resource allocation through reinforcement learning

Common Algorithm Types:

Value-based algorithms (like Q-learning)
Policy-based algorithms (like Policy Gradient)
Actor-Critic algorithms

Input/Output Space and Feature Space

In the scenario, each glass of wine is a sample, ten glasses form a sample set. Alcohol concentration, color depth and other information are called [Features]. These ten glasses are distributed in a [Multi-dimensional Feature Space].

All samples entering the current program’s “learning system” are called [Input], forming [Input Space]. The random variable values generated during learning are called [Output], forming [Output Space].

In supervised learning, when output variables are all continuous variables, the prediction problem becomes a regression problem. When output variables are finite discrete variables, the prediction problem is called a classification problem.

Overfitting and Underfitting

When the hypothesis space contains models of different complexities, we face model selection. We want to acquire a learner that performs well on new samples.

Fit can be simply understood as the model’s degree of mastery of the objective laws behind the dataset. If model fit is poor, it doesn’t fully capture the laws, resulting in low accuracy when used for classification and prediction.

Overfitting: When the model learns the training samples too well, it likely has already captured some characteristics of the training samples as general properties of all potential samples. At this time, the selected model is often more complex than the true model, leading to degraded generalization performance.

Underfitting: Refers to low learning capability, where the general properties of training samples haven’t been well learned yet.

Overfitting Chart Explanation

Left chart (first-order polynomial, underfitting): Training dataset accuracy and cross-validation dataset accuracy are close together, overall level relatively low, converging around 0.88
Middle chart (third-order polynomial, good fit): Training dataset accuracy and cross-validation dataset accuracy are close together, overall level relatively high
Right chart (tenth-order polynomial, overfitting): Training dataset accuracy very high (0.95), but cross-validation dataset accuracy lower (0.91), large gap between them

Machine Learning Workflow

Data Collection & Preprocessing: Data is the foundation of machine learning, typically collected from various sources, then cleaned, normalized, missing values processed and other preprocessing operations performed
Feature Engineering: Feature engineering refers to extracting useful features from raw data, including feature selection, feature scaling, encoding, etc.
Model Selection: Choose suitable algorithm model based on problem type (classification, regression, clustering, etc.)
Model Training: Input preprocessed data into selected machine learning algorithm, let model learn using training data
Model Evaluation: Evaluate model performance using test set, common evaluation metrics include accuracy, precision, recall, F1 score, mean squared error, etc.
Model Tuning: Further optimize model performance by adjusting model parameters or introducing more data
Model Deployment & Application: Once model passes evaluation, it can be deployed in actual applications

Common Machine Learning Algorithms

Linear Regression

Linear regression is the basic algorithm for solving regression problems, establishing a model by finding linear relationship between input variables (X) and output variables (Y). Mathematical expression: Y = β₀ + β₁X + ε.

Typical Applications: House price prediction, Sales forecasting

Logistic Regression

Logistic regression is a classic algorithm for solving binary classification problems. Although containing “regression” in the name, it’s actually a classification algorithm. It maps linear regression output to (0,1) interval through Sigmoid function.

Typical Applications: Spam detection, Disease diagnosis

Decision Tree

Decision tree is a tree-structured algorithm, building models by recursively dividing data space. Each internal node represents a feature test, each leaf node represents prediction result.

Typical Applications: Loan approval decisions, Customer classification

Random Forest

Random forest is an ensemble method of decision trees, improving performance by building multiple decision trees and combining their prediction results.

Advantages: Better generalization ability, can effectively reduce overfitting risk, can evaluate feature importance

Support Vector Machine (SVM)

Support vector machine is a powerful classification and regression algorithm. Its core is finding optimal decision boundary (hyperplane), maximizing margin between different categories.

Advantages: Theoretically complete, globally optimal, suitable for small sample, high-dimensional scenarios

K-Means Clustering

K-means clustering is an unsupervised learning algorithm, goal is to divide n data points into k clusters, making points within same cluster as close as possible, points in different clusters as far apart as possible.

Applications: Customer segmentation, Image compression

Neural Networks

Neural networks mimic biological neuron structure, consisting of input layer, hidden layer and output layer, implementing non-linear transformations through activation functions.

Application Domains: Computer Vision, Natural Language Processing, Speech Recognition

Challenges in Machine Learning

Data Quality: Model performance largely depends on data quality and quantity
Model Overfitting: Model performs excellently on training data but poorly on new data
Interpretability: Complex machine learning models often difficult to interpret internal decision logic

Error Quick Reference

Symptom	Root Cause	Fix
Treating “Logistic Regression” as regression problem	Misleading name: output is category probability, essentially classification	Clearly distinguish “regression/classification” by label type
Training set score high, validation set score obviously low	Overfitting	Reduce complexity, regularization, more data, cross-validation
Both training and validation sets low and close	Underfitting	Enhance features, use stronger model, relax hypothesis space
Reinforcement learning training unstable	Poor reward function design	Restructure rewards, add constraints/penalty items, adjust exploration rate