Decision Tree Introduction
Basic Introduction
Decision tree is a non-linear supervised classification model. Conditional branch structure in program design is the if-else structure.
Characteristics of decision tree:
- Is a tree structure, essentially a tree composed of multiple judgment nodes
- Each internal node represents a judgment on an attribute
- Each branch represents an output of judgment result
- Each leaf node represents a classification result
Core Idea
Through a series of “if … then …” split rules, decompose complex decision processes into several simple judgments, finally reaching leaf node to give prediction or decision.
Applicable Tasks
Both Classification and Regression; also commonly used for feature engineering (like automatic binning) and interpretability analysis.
Representative Algorithms
ID3, C4.5, CART (most commonly used), and derived ensemble methods: Random Forest, Gradient Boosting Trees (GBDT / XGBoost / LightGBM / CatBoost).
Structure and Terminology
- Root node: Contains complete sample set
- Internal node: Splits sample into “purer” subsets based on some feature and threshold
- Leaf node: Outputs class label or numeric prediction result
- Path: A decision chain from root to leaf, equivalent to a rule combination
Classification Principle
Based on the information in the first four columns, use decision tree to predict car accident occurrence. How to choose root node?
By Weather
Use “weather” column as root node, use decision tree to predict
By Temperature
Use “temperature” column as root node, use decision tree to predict
By Humidity
Use “humidity” column as root node, use decision tree to predict
By Wind
Use “wind” column as root node, use decision tree to predict
Simple Summary
Only when using weather as root node, the decision tree height is relatively low and the two sides of the tree can classify data more thoroughly (when other columns are used as root node, classification on both sides is not pure, both have weather)
Classification Principle Summary:
The decision tree building process is a recursive process of continuously splitting data. Each split tries to put data of the same category on one side of the tree. When leaf node data is all of one category, stop classification. After classifying data like this, data on both sides of each node is different, classifying the same data to one side of the tree can make data classification purer, reducing tree height and decision tree training iterations.
Classification Principle
Introduction to Entropy
In physics, entropy is a measure of “disorder” degree. The more ordered the system, the lower the entropy value; the more chaotic or dispersed the system, the higher the entropy value. In 1948, Shannon proposed the concept of information entropy.
How to measure purity and chaos (magnitude of information), can use information entropy or Gini coefficient.
Entropy definition:
- Under a certain category, the more information, the higher the entropy
- The less information, the lower the entropy
- Assume column “has job” only has information category “no”, then information entropy of column “has job”: H=-(1×log1)=0
In the figure above, if using “has job”, “age”, “credit situation”, “has house” columns to predict “category” with decision tree. How to choose root node classification condition for decision tree is to find which column as classification condition makes “category” column more thorough classification, that is, find when a column is used as classification condition, “category” information entropy decreases the most compared to without this classification condition (the biggest decrease means lower entropy, more thorough classification). This condition is the classification node’s classification condition. Here we need to use conditional entropy and information gain.
Conditional Entropy
Definition: Information entropy of a certain category under a certain classification condition, uncertainty of X given Y.
Information Gain
Definition: Represents the degree of entropy change, information entropy before classification minus information entropy after classification
When building decision tree, choosing attribute with large information gain as classification node is also called ID3 classification algorithm.
Gini Coefficient
Gini coefficient can also represent sample chaos degree, formula is as follows:
Where K represents there are K categories in current list.
Smaller Gini coefficient represents purer information, fewer categories; larger Gini coefficient represents more chaotic information, more categories. Calculation of Gini gain is the same as information gain. Assume a column has only one type of value, then Gini coefficient of this column is 0.
Information Gain Rate
In the figure above, if “record ID” is also used as classification condition, because “record ID” has conditional entropy of 0 for “whether to loan” column, we can get that “whether to loan” has the largest information gain under “record ID” classification condition. If choosing “record ID” as classification condition, can completely separate samples, information entropy after classification is 0, classification result is completely correct, information gain is maximum. This way we get a huge tree, this classification method is unreasonable.
Using information gain to filter classification conditions tends to prefer more mixed attributes, easily leading to overfitting. Can use information gain rate to solve this problem.
For example, under “record ID” condition, information gain of “whether to loan” is maximum, information entropy H(record ID) is also relatively large, dividing them is the gain rate under “record ID” condition, result is relatively small, laughing at the drawback of using information gain to select classification conditions when some attributes are more mixed.
Using information gain rate to build decision tree is also called C4.5 algorithm. Generally, for information gain, choosing classification conditions by information gain rate is more appropriate.