Introduction
In geometry, we measure distance between points with Euclidean distance. But how do we measure the "distance" between two probability distributions?
For example, how different is a Gaussian from ? Or the word distribution in Shakespeare versus a Reddit thread?
KL Divergence (Kullback-Leibler Divergence, also called Relative Entropy) quantifies this difference. It measures how much information is lost when we approximate one distribution with another.
Not a True Metric
While KL Divergence is often called a "distance," it doesn't satisfy all the properties of a mathematical distance metric (it's asymmetric and doesn't obey the triangle inequality). Think of it as a directed measure of divergence.
Intuition: Extra Bits
Imagine you're designing a compression code for English text. Optimal codes assign short representations to frequent letters ('e', 't') and long ones to rare letters ('z', 'q').
- P (True Distribution): The actual frequency of letters in English.
- Q (Model/Approximation): Your guess at the frequency (maybe based on French).
If you build your code based on Q but data comes from P, your code will be inefficient. You'll use long codes for letters that are actually common.
KL Divergence = Extra Bits
measures the extra bits you need to transmit because you used the wrong distribution Q instead of the true P. It quantifies the cost of your approximation.
The Formula
For discrete probability distributions P and Q:
Critical Warning
If but , KL divergence goes to infinity. The model Q thinks an event is impossible when it actually occurs. This is a fatal modeling error called "zero-avoiding" behavior.
For Continuous Distributions
For Gaussians, there's a closed-form solution, making KL Divergence computationally efficient.
Interactive: KL Between Gaussians
Explore how KL divergence changes as you move the model distribution Q away from the true distribution P.
KL Divergence
Measure the information lost when approximating P with Q.
Asymmetry: Not a True Distance
In geometry, distance is symmetric: Distance(A to B) = Distance(B to A). KL Divergence is NOT symmetric.
This asymmetry is not a bug. It reflects a real difference: "How much does Q fail to capture P?" is a different question from "How much does P fail to capture Q?"
Worked Example
Let (fair coin) and (biased coin).
Forward KL:
Reverse KL:
- The direction matters!
What are "nats"?
Nats (natural units) measure information using (natural log, base ). Bits use . To convert: . Deep learning frameworks typically use nats because has simpler derivatives.
Forward vs Reverse KL
The choice between (Forward) and (Reverse) has major implications for how your model behaves.
Forward KL: DKL(P || Q)
Mean-Seeking / Moment-Matching
Q must cover all the mass of P. If P is multimodal, Q will spread out to cover all modes, potentially with density in between.
Used in: VAEs, Maximum Likelihood, Expectation Maximization
Reverse KL: DKL(Q || P)
Mode-Seeking / Zero-Avoiding
Q avoids placing mass where P is zero. If P is multimodal, Q will collapse to cover only ONE mode, ignoring the rest.
Used in: Variational Inference, Some GANs, Policy Gradients
Multimodal Example
If P has two peaks (bimodal), Forward KL makes Q spread across both peaks (blurry, inclusive). Reverse KL makes Q collapse to just one peak (sharp but missing mass).
Interactive: Forward vs Reverse
See how the direction of KL Divergence dramatically changes model behavior for multimodal distributions.
Forward vs Reverse KL
Two ways to fit a distribution
Zero-Avoiding (Mass Covering)
Forward KL penalizes Q if P(x) > 0 but Q(x) ≈ 0. To avoid infinite penalty, Q stretches to cover ALL of P's support, often becoming blurry.
Relation to Entropy
KL Divergence connects beautifully to Entropy and Cross-Entropy:
KL Divergence = Cross-Entropy - Entropy
Since H(P) is constant for fixed data, minimizing Cross-Entropy is equivalent to minimizing KL Divergence. This is why Cross-Entropy loss makes models learn to match the data distribution.
Why This Connection Matters
In classification, we don't actually need to compute H(P) because it's constant. We just minimize H(P,Q), which implicitly minimizes the divergence between our model and the true distribution!
Interactive: Entropy Relationship
Explore the decomposition: Cross-Entropy = Entropy + KL Divergence.
The Information Theory Identity
Visualizing
Key Insight: Since H(P) is fixed by the data, minimizing Cross-Entropy is mathematically identical to minimizing KL Divergence.
This is why training a neural net with Cross-Entropy loss makes it learn the true distribution!
The model is inefficient. Cross-Entropy is higher than the theoretical minimum (Entropy).
Key Properties of KL Divergence
1. Non-Negativity (Gibbs' Inequality)
KL Divergence is always non-negative, and equals zero if and only if everywhere.
Why? From Jensen's Inequality:
Since , we have .
ML implication: You can never have "negative divergence." If your loss is negative, there's a bug in your code.
2. Asymmetry (Not a Metric)
Unlike Euclidean distance, KL Divergence is not symmetric. The "distance" from P to Q differs from Q to P.
Forward KL
"How well does Q explain P?"
Reverse KL
"How well does P explain Q?"
Why it fails as a metric: A true metric requires symmetry () and triangle inequality. KL satisfies neither.
3. Convexity
is convex in the pair . For any :
Mixing distributions reduces or maintains the divergence. This property ensures:
- Gradient descent converges to global minimum (for fixed P)
- No local minima traps in variational inference
- Optimization is computationally tractable
4. Additive for Independent Variables
If and are independent under both P and Q:
Derivation:
ML implication: For high-dimensional data with independent features, you can compute KL divergence feature-by-feature and sum them up.
5. Invariance Under Reparameterization
For any invertible transformation :
KL Divergence only depends on the probability values, not how the space is parameterized. Scaling, rotating, or nonlinearly transforming your features does not change the divergence.
ML Applications
Variational Autoencoders (VAEs)
The VAE loss has two parts: Reconstruction Loss + KL Divergence. The KL term forces the learned latent distribution q(z|x) to be close to a standard Normal prior N(0, I).
This regularization ensures smooth, interpretable latent spaces for generation.
t-SNE Visualization
t-SNE minimizes the KL Divergence between the distribution of pairwise distances in high-dimensional space and low-dimensional space. This preserves local structure while reducing dimensions.
Critical for visualizing embeddings, gene expression data, and neural network activations.
Knowledge Distillation
Training a small "student" model to mimic a large "teacher" model. The loss is the KL Divergence between the teacher's softmax outputs (with temperature) and the student's outputs.
Used to compress BERT → DistilBERT, reducing size by 40% with 97% performance.
Reinforcement Learning (TRPO/PPO)
Trust Region Policy Optimization constrains policy updates so the new policy doesn't diverge too much from the old policy, measured by KL Divergence. PPO uses a clipped surrogate objective.
Powers OpenAI's robotic control, Dota 2 agents, and ChatGPT's RLHF training.
Generative Adversarial Networks (GANs)
Some GAN formulations (f-GAN) use KL or reverse KL divergence. The choice affects whether the generator covers all modes (forward KL) or focuses on high-quality single modes (reverse KL).
Explains mode collapse: reverse KL encourages sharp, single-mode outputs.
Bayesian Model Selection
KL Divergence measures how well an approximate posterior matches the true posterior in variational Bayes. Also used in Akaike Information Criterion (AIC) for model comparison.
Theoretical foundation for choosing between competing statistical models.