What is information gain in decision trees?

Information gain measures the reduction in entropy (uncertainty) achieved by splitting data on a specific feature. It quantifies how much 'cleaner' or 'purer' the data becomes after a split. The formula is IG(S,A) = H(S) - Σ (|Sv|/|S|) H(Sv), where H(S) is parent entropy and the sum calculates weighted average of child entropies.

What is the information gain formula?

The information gain formula is IG(S,A) = H(S) - Σ (|Sv|/|S|) H(Sv), where H(S) is the entropy of the parent node, |Sv|/|S| is the proportion of samples in child node v, and H(Sv) is the entropy of child node v. It equals parent entropy minus weighted average of child entropies.

How do you calculate information gain step by step?

To calculate information gain: (1) Calculate parent entropy H(S) = -Σ pi log2(pi), (2) Split data by the feature into child nodes, (3) Calculate entropy for each child node, (4) Compute weighted average of child entropies using proportions |Sv|/|S|, (5) Subtract weighted average from parent entropy: IG = H(S) - weighted average.

What is the difference between Gini impurity and entropy?

Gini impurity (1 - Σ p²) and entropy (-Σ p log p) measure node impurity differently. Gini is faster to compute (no logarithm) and is the default in scikit-learn's CART algorithm. Entropy is information-theoretically grounded. In practice, they produce nearly identical trees (~95% same splits), so the choice rarely matters.

What is the ID3 algorithm?

ID3 (Iterative Dichotomiser 3) is a decision tree algorithm that uses information gain to build trees greedily. At each node: (1) Calculate information gain for every feature, (2) Select the feature with highest information gain, (3) Split data on that feature, (4) Recurse on child nodes until pure or max depth reached.

Can information gain be negative?

No, information gain cannot be negative. It ranges from 0 (split provides no information) to H(S) (split perfectly separates classes). Information gain equals parent entropy minus weighted child entropy. Since weighted average of child entropies cannot exceed parent entropy, IG ≥ 0 always.

What is the problem with information gain?

Information gain is biased toward features with many unique values. Splitting on user IDs (unique per person) gives maximum IG because each child has 1 sample (H=0), but this overfits and doesn't generalize. The solution is gain ratio (C4.5 algorithm), which divides IG by split information to penalize high-cardinality features.

How is information gain related to mutual information?

Information gain is exactly mutual information between the feature and the label. IG(S,A) = I(Feature;Label) measures how much knowing the feature value reduces uncertainty about the label. Both quantify the same concept: shared information between two variables.

Information Gain: Formula, Calculation & Gini vs Entropy

Introduction

Building a Decision Tree involves asking a sequence of questions to split your data. But which question should you ask first?

Should you ask "Is it raining?" or "Is it humid?" first? To answer this, we need a way to measure how much "cleaner" (purer) the data becomes after a split.

Information Gain (IG) is exactly this measure. It quantifies the reduction in entropy (uncertainty) achieved by splitting on a specific feature. It is essentially Mutual Information between the feature and the label.

Intuition: The 20 Questions Game

Imagine playing "20 Questions." Your goal is to identify a secret object with yes/no questions.

Bad Question

"Is the object a banana?"

99% chance the answer is "No." You are left with almost the same uncertainty. Low Information Gain.

Good Question

"Is the object alive?"

Splits the world roughly in half. Regardless of the answer, you have eliminated 50% of possibilities. High Information Gain.

Decision Trees use Information Gain to find the "Is it alive?" questions, not the "Is it a banana?" questions.

Recap: Entropy as Impurity

Before we split, we calculate the entropy of the current dataset (the parent node). Entropy measures "impurity" or "disorder."

H(S) = -\sum_{i=1}^c p_i \log_2(p_i)

Pure (H = 0): All samples same class.

Impure (H = 1): 50/50 split.

The Information Gain Formula

Information Gain = Entropy Before - Entropy After.

Since a split creates multiple child nodes, "Entropy After" is the weighted average of child entropies.

IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)

H(S)

Entropy of parent node.

Weighted Sum

Large children matter more than small ones.

H(S_v)

Entropy of child node v.

Interactive: Split Visualization

Drag the split point to see how Information Gain changes. The best split creates the purest child nodes.

Recursive Partitioning

Build a Depth-2 Decision Tree. First split the root, then split the children.

Root SplitIG: 0.001

Left Split

Right Split

Class 1

Class 0

Total Information Gain: 0.001 bits

(Sum of IG at each active node)

Step-by-Step Calculation

Dataset: 10 examples, 5 Positive (+), 5 Negative (-). Testing feature "Windy" (True/False).

Step 1: Parent Entropy

H(Parent) = -0.5 \log_2(0.5) - 0.5 \log_2(0.5) = 1.0 \text{ bit}

Step 2: Split Data

Windy=False (6 samples): 4 Pos, 2 Neg
Windy=True (4 samples): 1 Pos, 3 Neg

Step 3: Child Entropies

Left:

H = -\frac{4}{6}\log_2\frac{4}{6} - \frac{2}{6}\log_2\frac{2}{6} \approx 0.918

Right:

H = -\frac{1}{4}\log_2\frac{1}{4} - \frac{3}{4}\log_2\frac{3}{4} \approx 0.811

Step 4: Information Gain

IG = 1.0 - [\frac{6}{10} \times 0.918 + \frac{4}{10} \times 0.811]

IG = 1.0 - 0.875 = 0.125 \text{ bits}

The ID3 Algorithm

ID3 (Iterative Dichotomiser 3) uses Information Gain to build trees greedily:

Calculate H(S) for the current node.
For every possible feature A, calculate IG(S, A).
Select the feature with the highest Information Gain.
Split the data using that feature.
Recurse on each child until nodes are pure or max depth is reached.

Gini Impurity vs Entropy

Scikit-Learn's CART algorithm uses Gini Impurity by default instead of Entropy. Why?

Entropy

-\sum p \log p

Pro: Information-theoretically grounded.
Con: Slower (log computation).

Gini Impurity

1 - \sum p^2

Pro: Faster (only square).
Con: Similar results (~95% same splits).

In practice, they produce nearly identical trees. Gini is default because it is slightly faster.

Gini Impurity vs Entropy

Compare the shape of Gini Impurity and Entropy. Notice how Gini is just a quadratic approximation of Entropy.

Probability p (Class 1): 0.50

Entropy H(p)

1.000

Max at p=0.5 (1.0 bits)

Gini Impurity G(p)

0.500

Max at p=0.5 (0.5)

--- Dashed Green is Entropy scaled by 0.5. It almost perfectly overlaps Gini!

Limitations & Gain Ratio

The "User ID" Problem

Information Gain is biased towards features with many unique values.

If you split on "User ID" (unique for everyone), each child has 1 sample with H = 0. Maximum IG! But this is useless for generalization.

Solution: Gain Ratio (C4.5)

GainRatio(S, A) = \frac{IG(S, A)}{SplitInfo(A)}

Divides by the "intrinsic information" of the split itself, penalizing splits with many branches.

Contents

Introduction

Intuition: The 20 Questions Game

Bad Question

Good Question

Recap: Entropy as Impurity

The Information Gain Formula

Interactive: Split Visualization

Recursive Partitioning

Step-by-Step Calculation

Step 1: Parent Entropy

Step 2: Split Data

Step 3: Child Entropies

Step 4: Information Gain

The ID3 Algorithm

Gini Impurity vs Entropy

Entropy

Gini Impurity

Gini Impurity vs Entropy

Entropy H(p)

Gini Impurity G(p)

Limitations & Gain Ratio

The "User ID" Problem

Solution: Gain Ratio (C4.5)