Modules
05/07
Information Theory

Contents

Information Gain

The metric that tells a Decision Tree the best question to ask next. The practical application of entropy.

Introduction

Building a Decision Tree involves asking a sequence of questions to split your data. But which question should you ask first?

Should you ask "Is it raining?" or "Is it humid?" first? To answer this, we need a way to measure how much "cleaner" (purer) the data becomes after a split.

Information Gain (IG) is exactly this measure. It quantifies the reduction in entropy (uncertainty) achieved by splitting on a specific feature. It is essentially Mutual Information between the feature and the label.

Intuition: The 20 Questions Game

Imagine playing "20 Questions." Your goal is to identify a secret object with yes/no questions.

Bad Question

"Is the object a banana?"

99% chance the answer is "No." You are left with almost the same uncertainty. Low Information Gain.

Good Question

"Is the object alive?"

Splits the world roughly in half. Regardless of the answer, you have eliminated 50% of possibilities. High Information Gain.

Decision Trees use Information Gain to find the "Is it alive?" questions, not the "Is it a banana?" questions.

Recap: Entropy as Impurity

Before we split, we calculate the entropy of the current dataset (the parent node). Entropy measures "impurity" or "disorder."

H(S)=i=1cpilog2(pi)H(S) = -\sum_{i=1}^c p_i \log_2(p_i)
Pure (H = 0): All samples same class.
Impure (H = 1): 50/50 split.

The Information Gain Formula

Information Gain = Entropy Before - Entropy After.

Since a split creates multiple child nodes, "Entropy After" is the weighted average of child entropies.

IG(S,A)=H(S)vValues(A)SvSH(Sv)IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)
H(S)

Entropy of parent node.

Weighted Sum

Large children matter more than small ones.

H(S_v)

Entropy of child node v.

Interactive: Split Visualization

Drag the split point to see how Information Gain changes. The best split creates the purest child nodes.

Recursive Partitioning

Build a Depth-2 Decision Tree. First split the root, then split the children.

Root SplitIG: 0.001
Left Split
Right Split
Root
Class 1
Class 0

Total Information Gain: 0.001 bits

(Sum of IG at each active node)

Step-by-Step Calculation

Dataset: 10 examples, 5 Positive (+), 5 Negative (-). Testing feature "Windy" (True/False).

Step 1: Parent Entropy

H(Parent)=0.5log2(0.5)0.5log2(0.5)=1.0 bitH(Parent) = -0.5 \log_2(0.5) - 0.5 \log_2(0.5) = 1.0 \text{ bit}

Step 2: Split Data

  • Windy=False (6 samples): 4 Pos, 2 Neg
  • Windy=True (4 samples): 1 Pos, 3 Neg

Step 3: Child Entropies

Left:

H=46log24626log2260.918H = -\frac{4}{6}\log_2\frac{4}{6} - \frac{2}{6}\log_2\frac{2}{6} \approx 0.918

Right:

H=14log21434log2340.811H = -\frac{1}{4}\log_2\frac{1}{4} - \frac{3}{4}\log_2\frac{3}{4} \approx 0.811

Step 4: Information Gain

IG=1.0[610×0.918+410×0.811]IG = 1.0 - [\frac{6}{10} \times 0.918 + \frac{4}{10} \times 0.811]

IG=1.00.875=0.125 bitsIG = 1.0 - 0.875 = 0.125 \text{ bits}

The ID3 Algorithm

ID3 (Iterative Dichotomiser 3) uses Information Gain to build trees greedily:

  1. Calculate H(S) for the current node.
  2. For every possible feature A, calculate IG(S, A).
  3. Select the feature with the highest Information Gain.
  4. Split the data using that feature.
  5. Recurse on each child until nodes are pure or max depth is reached.

Gini Impurity vs Entropy

Scikit-Learn's CART algorithm uses Gini Impurity by default instead of Entropy. Why?

Entropy

plogp-\sum p \log p

Pro: Information-theoretically grounded.
Con: Slower (log computation).

Gini Impurity

1p21 - \sum p^2

Pro: Faster (only square).
Con: Similar results (~95% same splits).

In practice, they produce nearly identical trees. Gini is default because it is slightly faster.

Gini Impurity vs Entropy

Compare the shape of Gini Impurity and Entropy. Notice how Gini is just a quadratic approximation of Entropy.

Entropy H(p)

1.000

Max at p=0.5 (1.0 bits)

Gini Impurity G(p)

0.500

Max at p=0.5 (0.5)

--- Dashed Green is Entropy scaled by 0.5. It almost perfectly overlaps Gini!

Limitations & Gain Ratio

The "User ID" Problem

Information Gain is biased towards features with many unique values.

If you split on "User ID" (unique for everyone), each child has 1 sample with H = 0. Maximum IG! But this is useless for generalization.

Solution: Gain Ratio (C4.5)

GainRatio(S,A)=IG(S,A)SplitInfo(A)GainRatio(S, A) = \frac{IG(S, A)}{SplitInfo(A)}

Divides by the "intrinsic information" of the split itself, penalizing splits with many branches.