Introduction
Building a Decision Tree involves asking a sequence of questions to split your data. But which question should you ask first?
Should you ask "Is it raining?" or "Is it humid?" first? To answer this, we need a way to measure how much "cleaner" (purer) the data becomes after a split.
Information Gain (IG) is exactly this measure. It quantifies the reduction in entropy (uncertainty) achieved by splitting on a specific feature. It is essentially Mutual Information between the feature and the label.
Intuition: The 20 Questions Game
Imagine playing "20 Questions." Your goal is to identify a secret object with yes/no questions.
Bad Question
"Is the object a banana?"
99% chance the answer is "No." You are left with almost the same uncertainty. Low Information Gain.
Good Question
"Is the object alive?"
Splits the world roughly in half. Regardless of the answer, you have eliminated 50% of possibilities. High Information Gain.
Decision Trees use Information Gain to find the "Is it alive?" questions, not the "Is it a banana?" questions.
Recap: Entropy as Impurity
Before we split, we calculate the entropy of the current dataset (the parent node). Entropy measures "impurity" or "disorder."
The Information Gain Formula
Information Gain = Entropy Before - Entropy After.
Since a split creates multiple child nodes, "Entropy After" is the weighted average of child entropies.
Entropy of parent node.
Large children matter more than small ones.
Entropy of child node v.
Interactive: Split Visualization
Drag the split point to see how Information Gain changes. The best split creates the purest child nodes.
Recursive Partitioning
Build a Depth-2 Decision Tree. First split the root, then split the children.
Total Information Gain: 0.001 bits
(Sum of IG at each active node)
Step-by-Step Calculation
Dataset: 10 examples, 5 Positive (+), 5 Negative (-). Testing feature "Windy" (True/False).
Step 1: Parent Entropy
Step 2: Split Data
- Windy=False (6 samples): 4 Pos, 2 Neg
- Windy=True (4 samples): 1 Pos, 3 Neg
Step 3: Child Entropies
Left:
Right:
Step 4: Information Gain
The ID3 Algorithm
ID3 (Iterative Dichotomiser 3) uses Information Gain to build trees greedily:
- Calculate H(S) for the current node.
- For every possible feature A, calculate IG(S, A).
- Select the feature with the highest Information Gain.
- Split the data using that feature.
- Recurse on each child until nodes are pure or max depth is reached.
Gini Impurity vs Entropy
Scikit-Learn's CART algorithm uses Gini Impurity by default instead of Entropy. Why?
Entropy
Pro: Information-theoretically grounded.
Con: Slower (log computation).
Gini Impurity
Pro: Faster (only square).
Con: Similar results (~95% same splits).
In practice, they produce nearly identical trees. Gini is default because it is slightly faster.
Gini Impurity vs Entropy
Compare the shape of Gini Impurity and Entropy. Notice how Gini is just a quadratic approximation of Entropy.
Entropy H(p)
Max at p=0.5 (1.0 bits)
Gini Impurity G(p)
Max at p=0.5 (0.5)
Limitations & Gain Ratio
The "User ID" Problem
Information Gain is biased towards features with many unique values.
If you split on "User ID" (unique for everyone), each child has 1 sample with H = 0. Maximum IG! But this is useless for generalization.
Solution: Gain Ratio (C4.5)
Divides by the "intrinsic information" of the split itself, penalizing splits with many branches.