Modules
03/07
Information Theory

Contents

Kullback-Leibler Divergence

Measuring how one probability distribution differs from another. The asymmetric "distance" that powers VAEs, GANs, and t-SNE.

Introduction

In geometry, we measure distance between points with Euclidean distance. But how do we measure the "distance" between two probability distributions?

For example, how different is a Gaussian N(0,1)N(0, 1) from N(2,2)N(2, 2)? Or the word distribution in Shakespeare versus a Reddit thread?

KL Divergence (Kullback-Leibler Divergence, also called Relative Entropy) quantifies this difference. It measures how much information is lost when we approximate one distribution with another.

Not a True Metric

While KL Divergence is often called a "distance," it doesn't satisfy all the properties of a mathematical distance metric (it's asymmetric and doesn't obey the triangle inequality). Think of it as a directed measure of divergence.

Intuition: Extra Bits

Imagine you're designing a compression code for English text. Optimal codes assign short representations to frequent letters ('e', 't') and long ones to rare letters ('z', 'q').

  • P (True Distribution): The actual frequency of letters in English.
  • Q (Model/Approximation): Your guess at the frequency (maybe based on French).

If you build your code based on Q but data comes from P, your code will be inefficient. You'll use long codes for letters that are actually common.

KL Divergence = Extra Bits

DKL(PQ)D_{KL}(P || Q) measures the extra bits you need to transmit because you used the wrong distribution Q instead of the true P. It quantifies the cost of your approximation.

The Formula

For discrete probability distributions P and Q:

DKL(PQ)=xP(x)log(P(x)Q(x))D_{KL}(P || Q) = \sum_{x} P(x) \log \left( \frac{P(x)}{Q(x)} \right)
=xP(x)[logP(x)logQ(x)]= \sum_{x} P(x) [\log P(x) - \log Q(x)]

Critical Warning

If P(x)>0P(x) > 0 but Q(x)=0Q(x) = 0, KL divergence goes to infinity. The model Q thinks an event is impossible when it actually occurs. This is a fatal modeling error called "zero-avoiding" behavior.

For Continuous Distributions

DKL(PQ)=P(x)log(P(x)Q(x))dxD_{KL}(P || Q) = \int P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx

For Gaussians, there's a closed-form solution, making KL Divergence computationally efficient.

Interactive: KL Between Gaussians

Explore how KL divergence changes as you move the model distribution Q away from the true distribution P.

KL Divergence

Measure the information lost when approximating P with Q.

KL(P||Q)0.000
P (True)Q (Model)
Q Mean (Shift)0.00
Q Std (Spread)1.00
P is N(0,1) Q is N(μ,σ)

Asymmetry: Not a True Distance

In geometry, distance is symmetric: Distance(A to B) = Distance(B to A). KL Divergence is NOT symmetric.

DKL(PQ)DKL(QP)D_{KL}(P || Q) \neq D_{KL}(Q || P)

This asymmetry is not a bug. It reflects a real difference: "How much does Q fail to capture P?" is a different question from "How much does P fail to capture Q?"

Worked Example

Let P=[0.5,0.5]P = [0.5, 0.5] (fair coin) and Q=[0.9,0.1]Q = [0.9, 0.1] (biased coin).

Forward KL: DKL(PQ)D_{KL}(P || Q)

=0.5ln(0.50.9)+0.5ln(0.50.1)= 0.5 \cdot \ln\left(\frac{0.5}{0.9}\right) + 0.5 \cdot \ln\left(\frac{0.5}{0.1}\right)
=0.5(0.588)+0.5(1.609)= 0.5 \cdot (-0.588) + 0.5 \cdot (1.609)
=0.511 nats= 0.511 \text{ nats}

Reverse KL: DKL(QP)D_{KL}(Q || P)

=0.9ln(0.90.5)+0.1ln(0.10.5)= 0.9 \cdot \ln\left(\frac{0.9}{0.5}\right) + 0.1 \cdot \ln\left(\frac{0.1}{0.5}\right)
=0.9(0.588)+0.1(1.609)= 0.9 \cdot (0.588) + 0.1 \cdot (-1.609)
=0.368 nats= 0.368 \text{ nats}

0.5110.3680.511 \neq 0.368 - The direction matters!

What are "nats"?

Nats (natural units) measure information using ln\ln (natural log, base ee). Bits use log2\log_2. To convert: 1 nat=1ln(2)1.44 bits1 \text{ nat} = \frac{1}{\ln(2)} \approx 1.44 \text{ bits}. Deep learning frameworks typically use nats because ln\ln has simpler derivatives.

Forward vs Reverse KL

The choice between DKL(PQ)D_{KL}(P || Q) (Forward) and DKL(QP)D_{KL}(Q || P) (Reverse) has major implications for how your model behaves.

Forward KL: DKL(P || Q)

Mean-Seeking / Moment-Matching

Q must cover all the mass of P. If P is multimodal, Q will spread out to cover all modes, potentially with density in between.

Used in: VAEs, Maximum Likelihood, Expectation Maximization

Reverse KL: DKL(Q || P)

Mode-Seeking / Zero-Avoiding

Q avoids placing mass where P is zero. If P is multimodal, Q will collapse to cover only ONE mode, ignoring the rest.

Used in: Variational Inference, Some GANs, Policy Gradients

Multimodal Example

If P has two peaks (bimodal), Forward KL makes Q spread across both peaks (blurry, inclusive). Reverse KL makes Q collapse to just one peak (sharp but missing mass).

Interactive: Forward vs Reverse

See how the direction of KL Divergence dramatically changes model behavior for multimodal distributions.

Forward vs Reverse KL

Two ways to fit a distribution

Mean-Seeking

Zero-Avoiding (Mass Covering)

Forward KL penalizes Q if P(x) > 0 but Q(x) ≈ 0. To avoid infinite penalty, Q stretches to cover ALL of P's support, often becoming blurry.

DKL(PQ)=ExP[logP(x)Q(x)]D_{KL}(P || Q) = \mathbb{E}_{x \sim P} [\log \frac{P(x)}{Q(x)}]
Typical Use Cases
Variational Autoencoders (VAEs)
Maximum Likelihood Estimation
P (True)Q (Model)
P is complex (bimodal). Q is simple (unimodal).

Relation to Entropy

KL Divergence connects beautifully to Entropy and Cross-Entropy:

DKL(PQ)=H(P,Q)H(P)D_{KL}(P || Q) = H(P, Q) - H(P)

KL Divergence = Cross-Entropy - Entropy

Since H(P) is constant for fixed data, minimizing Cross-Entropy is equivalent to minimizing KL Divergence. This is why Cross-Entropy loss makes models learn to match the data distribution.

Why This Connection Matters

In classification, we don't actually need to compute H(P) because it's constant. We just minimize H(P,Q), which implicitly minimizes the divergence between our model and the true distribution!

Interactive: Entropy Relationship

Explore the decomposition: Cross-Entropy = Entropy + KL Divergence.

The Information Theory Identity

Visualizing H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P || Q)

True Distribution P0.70 / 0.30
Model Distribution Q0.50 / 0.50

Key Insight: Since H(P) is fixed by the data, minimizing Cross-Entropy is mathematically identical to minimizing KL Divergence.

This is why training a neural net with Cross-Entropy loss makes it learn the true distribution!

Total Cost (Bits)1.00
Cross-Entropy H(P,Q)
=
Min Cost0.88
Entropy H(P)
+
Inefficiency0.12
KL Divergence
Wasted Bits: 0.12

The model is inefficient. Cross-Entropy is higher than the theoretical minimum (Entropy).

Key Properties of KL Divergence

1. Non-Negativity (Gibbs' Inequality)

DKL(PQ)0D_{KL}(P || Q) \geq 0

KL Divergence is always non-negative, and equals zero if and only if P=QP = Q everywhere.

Why? From Jensen's Inequality:

DKL(PQ)=xP(x)ln(Q(x)P(x))-D_{KL}(P || Q) = \sum_x P(x) \ln\left(\frac{Q(x)}{P(x)}\right)
ln(xP(x)Q(x)P(x))=ln(1)=0\leq \ln\left(\sum_x P(x) \cdot \frac{Q(x)}{P(x)}\right) = \ln(1) = 0

Since DKL0-D_{KL} \leq 0, we have DKL0D_{KL} \geq 0.

ML implication: You can never have "negative divergence." If your loss is negative, there's a bug in your code.

2. Asymmetry (Not a Metric)

DKL(PQ)DKL(QP)D_{KL}(P || Q) \neq D_{KL}(Q || P)

Unlike Euclidean distance, KL Divergence is not symmetric. The "distance" from P to Q differs from Q to P.

Forward KL

DKL(PQ)D_{KL}(P || Q)

"How well does Q explain P?"

Reverse KL

DKL(QP)D_{KL}(Q || P)

"How well does P explain Q?"

Why it fails as a metric: A true metric requires symmetry (d(a,b)=d(b,a)d(a,b) = d(b,a)) and triangle inequality. KL satisfies neither.

3. Convexity

DKL(PQ)D_{KL}(P || Q) is convex in the pair (P,Q)(P, Q). For any λ[0,1]\lambda \in [0, 1]:

DKL(λP1+(1λ)P2λQ1+(1λ)Q2)λDKL(P1Q1)+(1λ)DKL(P2Q2)D_{KL}(\lambda P_1 + (1-\lambda)P_2 \,||\, \lambda Q_1 + (1-\lambda)Q_2) \leq \lambda D_{KL}(P_1 || Q_1) + (1-\lambda) D_{KL}(P_2 || Q_2)

Mixing distributions reduces or maintains the divergence. This property ensures:

  • Gradient descent converges to global minimum (for fixed P)
  • No local minima traps in variational inference
  • Optimization is computationally tractable

4. Additive for Independent Variables

If XX and YY are independent under both P and Q:

DKL(PXYQXY)=DKL(PXQX)+DKL(PYQY)D_{KL}(P_{XY} || Q_{XY}) = D_{KL}(P_X || Q_X) + D_{KL}(P_Y || Q_Y)

Derivation:

DKL(PXYQXY)=x,yP(x,y)lnP(x,y)Q(x,y)D_{KL}(P_{XY} || Q_{XY}) = \sum_{x,y} P(x,y) \ln\frac{P(x,y)}{Q(x,y)}
Since independent: P(x,y)=P(x)P(y)P(x,y) = P(x)P(y) and Q(x,y)=Q(x)Q(y)Q(x,y) = Q(x)Q(y)
=x,yP(x)P(y)lnP(x)P(y)Q(x)Q(y)= \sum_{x,y} P(x)P(y) \ln\frac{P(x)P(y)}{Q(x)Q(y)}
=x,yP(x)P(y)[lnP(x)Q(x)+lnP(y)Q(y)]= \sum_{x,y} P(x)P(y) \left[\ln\frac{P(x)}{Q(x)} + \ln\frac{P(y)}{Q(y)}\right]
=DKL(PXQX)+DKL(PYQY)= D_{KL}(P_X || Q_X) + D_{KL}(P_Y || Q_Y)

ML implication: For high-dimensional data with independent features, you can compute KL divergence feature-by-feature and sum them up.

5. Invariance Under Reparameterization

For any invertible transformation y=f(x)y = f(x):

DKL(PXQX)=DKL(PYQY)D_{KL}(P_X || Q_X) = D_{KL}(P_Y || Q_Y)

KL Divergence only depends on the probability values, not how the space is parameterized. Scaling, rotating, or nonlinearly transforming your features does not change the divergence.

ML Applications

Variational Autoencoders (VAEs)

The VAE loss has two parts: Reconstruction Loss + KL Divergence. The KL term forces the learned latent distribution q(z|x) to be close to a standard Normal prior N(0, I).

L=Eq[logp(xz)]+DKL(q(zx)p(z))\mathcal{L} = -E_q[\log p(x|z)] + D_{KL}(q(z|x) || p(z))

This regularization ensures smooth, interpretable latent spaces for generation.

t-SNE Visualization

t-SNE minimizes the KL Divergence between the distribution of pairwise distances in high-dimensional space and low-dimensional space. This preserves local structure while reducing dimensions.

Critical for visualizing embeddings, gene expression data, and neural network activations.

Knowledge Distillation

Training a small "student" model to mimic a large "teacher" model. The loss is the KL Divergence between the teacher's softmax outputs (with temperature) and the student's outputs.

Used to compress BERT → DistilBERT, reducing size by 40% with 97% performance.

Reinforcement Learning (TRPO/PPO)

Trust Region Policy Optimization constrains policy updates so the new policy doesn't diverge too much from the old policy, measured by KL Divergence. PPO uses a clipped surrogate objective.

Powers OpenAI's robotic control, Dota 2 agents, and ChatGPT's RLHF training.

Generative Adversarial Networks (GANs)

Some GAN formulations (f-GAN) use KL or reverse KL divergence. The choice affects whether the generator covers all modes (forward KL) or focuses on high-quality single modes (reverse KL).

Explains mode collapse: reverse KL encourages sharp, single-mode outputs.

Bayesian Model Selection

KL Divergence measures how well an approximate posterior matches the true posterior in variational Bayes. Also used in Akaike Information Criterion (AIC) for model comparison.

Theoretical foundation for choosing between competing statistical models.