What is KL divergence and what does it measure?

KL divergence (Kullback-Leibler divergence), also called relative entropy, is a measure of how much one probability distribution P differs from a reference distribution Q. Concretely, it quantifies the extra bits (or nats) needed to encode samples from P using a code optimised for Q instead of for P. It equals zero if and only if the two distributions are identical everywhere, and is always non-negative.

Is KL divergence symmetric and why does it matter?

No — KL divergence is asymmetric: D_KL(P || Q) ≠ D_KL(Q || P) in general. This matters because the two directions ask fundamentally different questions: the forward direction asks how well Q explains P, while the reverse asks how well P explains Q. In practice, minimising the forward KL produces mean-seeking (inclusive) approximations, whereas minimising the reverse KL produces mode-seeking (exclusive) approximations — a critical distinction in variational inference and generative modelling.

How is KL divergence used in Variational Autoencoders?

The VAE training objective (ELBO) consists of two terms: a reconstruction loss and a KL divergence term D_KL(q(z|x) || p(z)). The KL term regularises the encoder by penalising the learned posterior q(z|x) from drifting too far from the standard Normal prior p(z) = N(0, I). This regularisation ensures the latent space is smooth and well-structured, enabling coherent generation of new samples by sampling directly from the prior.

What is the relationship between KL divergence and cross-entropy?

KL divergence decomposes as D_KL(P || Q) = H(P, Q) − H(P), where H(P, Q) is the cross-entropy and H(P) is the Shannon entropy of P. Because H(P) is constant with respect to the model Q, minimising cross-entropy loss during training is exactly equivalent to minimising the KL divergence between the data distribution P and the model distribution Q. This is why cross-entropy is the standard loss for classification tasks.

What is the difference between forward and reverse KL divergence?

Forward KL, D_KL(P || Q), is mean-seeking: the approximation Q must place probability mass wherever P does, so it tends to spread over all modes of P. Reverse KL, D_KL(Q || P), is mode-seeking: Q avoids regions where P is zero, causing it to collapse onto a single mode of P when P is multimodal. Forward KL is used in maximum likelihood estimation and VAEs; reverse KL is common in variational inference and some policy-gradient methods such as TRPO.

KL Divergence: Measuring Distribution Difference

Introduction

In geometry, we measure distance between points with Euclidean distance. But how do we measure the "distance" between two probability distributions?

For example, how different is a Gaussian $N(0, 1)$ from $N(2, 2)$ ? Or the word distribution in Shakespeare versus a Reddit thread?

KL Divergence (Kullback-Leibler Divergence, also called Relative Entropy) quantifies this difference. It measures how much information is lost when we approximate one distribution with another.

Not a True Metric

While KL Divergence is often called a "distance," it doesn't satisfy all the properties of a mathematical distance metric (it's asymmetric and doesn't obey the triangle inequality). Think of it as a directed measure of divergence.

Intuition: Extra Bits

Imagine you're designing a compression code for English text. Optimal codes assign short representations to frequent letters ('e', 't') and long ones to rare letters ('z', 'q').

P (True Distribution): The actual frequency of letters in English.
Q (Model/Approximation): Your guess at the frequency (maybe based on French).

If you build your code based on Q but data comes from P, your code will be inefficient. You'll use long codes for letters that are actually common.

KL Divergence = Extra Bits

$D_{KL}(P || Q)$ measures the extra bits you need to transmit because you used the wrong distribution Q instead of the true P. It quantifies the cost of your approximation.

The Formula

For discrete probability distributions P and Q:

D_{KL}(P || Q) = \sum_{x} P(x) \log \left( \frac{P(x)}{Q(x)} \right)

= \sum_{x} P(x) [\log P(x) - \log Q(x)]

Critical Warning

If $P(x) > 0$ but $Q(x) = 0$ , KL divergence goes to infinity. The model Q thinks an event is impossible when it actually occurs. This is a fatal modeling error called "zero-avoiding" behavior.

For Continuous Distributions

D_{KL}(P || Q) = \int P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx

For Gaussians, there's a closed-form solution, making KL Divergence computationally efficient.

Interactive: KL Between Gaussians

Explore how KL divergence changes as you move the model distribution Q away from the true distribution P.

KL Divergence

Distance between q(z|x) and prior N(0,1)

μ (mean shift)0.0

log σ² (variance)0.0 (σ = 1.00)

KL = 0.000

✓ Close to prior

KL = -½ × (1 + log σ² - μ² - σ²) = -½ × (1 + 0.0 - 0.00 - 1.00)

Asymmetry: Not a True Distance

In geometry, distance is symmetric: Distance(A to B) = Distance(B to A). KL Divergence is NOT symmetric.

D_{KL}(P || Q) \neq D_{KL}(Q || P)

This asymmetry is not a bug. It reflects a real difference: "How much does Q fail to capture P?" is a different question from "How much does P fail to capture Q?"

Worked Example

Let $P = [0.5, 0.5]$ (fair coin) and $Q = [0.9, 0.1]$ (biased coin).

Forward KL: $D_{KL}(P || Q)$

= 0.5 \cdot \ln\left(\frac{0.5}{0.9}\right) + 0.5 \cdot \ln\left(\frac{0.5}{0.1}\right)

= 0.5 \cdot (-0.588) + 0.5 \cdot (1.609)

= 0.511 \text{ nats}

Reverse KL: $D_{KL}(Q || P)$

= 0.9 \cdot \ln\left(\frac{0.9}{0.5}\right) + 0.1 \cdot \ln\left(\frac{0.1}{0.5}\right)

= 0.9 \cdot (0.588) + 0.1 \cdot (-1.609)

= 0.368 \text{ nats}

$0.511 \neq 0.368$ - The direction matters!

What are "nats"?

Nats (natural units) measure information using $\ln$ (natural log, base $e$ ). Bits use $\log_2$ . To convert: $1 \text{ nat} = \frac{1}{\ln(2)} \approx 1.44 \text{ bits}$ . Deep learning frameworks typically use nats because $\ln$ has simpler derivatives.

Forward vs Reverse KL

The choice between $D_{KL}(P || Q)$ (Forward) and $D_{KL}(Q || P)$ (Reverse) has major implications for how your model behaves.

Forward KL: D_KL(P || Q)

Mean-Seeking / Moment-Matching

Q must cover all the mass of P. If P is multimodal, Q will spread out to cover all modes, potentially with density in between.

Used in: VAEs, Maximum Likelihood, Expectation Maximization

Reverse KL: D_KL(Q || P)

Mode-Seeking / Zero-Avoiding

Q avoids placing mass where P is zero. If P is multimodal, Q will collapse to cover only ONE mode, ignoring the rest.

Used in: Variational Inference, Some GANs, Policy Gradients

Multimodal Example

If P has two peaks (bimodal), Forward KL makes Q spread across both peaks (blurry, inclusive). Reverse KL makes Q collapse to just one peak (sharp but missing mass).

Interactive: Forward vs Reverse

See how the direction of KL Divergence dramatically changes model behavior for multimodal distributions.

Forward vs Reverse KL

Two ways to fit a distribution

Mean-Seeking

Zero-Avoiding (Mass Covering)

Forward KL penalizes Q if P(x) > 0 but Q(x) ≈ 0. To avoid infinite penalty, Q stretches to cover ALL of P's support, often becoming blurry.

D_{KL}(P || Q) = \mathbb{E}_{x \sim P} [\log \frac{P(x)}{Q(x)}]

Typical Use Cases

Variational Autoencoders (VAEs)

Maximum Likelihood Estimation

P is complex (bimodal). Q is simple (unimodal).

Relation to Entropy

KL Divergence connects beautifully to Entropy and Cross-Entropy:

D_{KL}(P || Q) = H(P, Q) - H(P)

KL Divergence = Cross-Entropy - Entropy

Since H(P) is constant for fixed data, minimizing Cross-Entropy is equivalent to minimizing KL Divergence. This is why Cross-Entropy loss makes models learn to match the data distribution.

Why This Connection Matters

In classification, we don't actually need to compute H(P) because it's constant. We just minimize H(P,Q), which implicitly minimizes the divergence between our model and the true distribution!

Interactive: Entropy Relationship

Explore the decomposition: Cross-Entropy = Entropy + KL Divergence.

The Information Theory Identity

Visualizing $H(P, Q) = H(P) + D_{KL}(P || Q)$

True Distribution P0.70 / 0.30

Model Distribution Q0.50 / 0.50

Key Insight: Since H(P) is fixed by the data, minimizing Cross-Entropy is mathematically identical to minimizing KL Divergence.

This is why training a neural net with Cross-Entropy loss makes it learn the true distribution!

Total Cost (Bits)1.00

Cross-Entropy H(P,Q)

Min Cost0.88

Entropy H(P)

Inefficiency0.12

KL Divergence

Wasted Bits: 0.12

The model is inefficient. Cross-Entropy is higher than the theoretical minimum (Entropy).

Key Properties of KL Divergence

1. Non-Negativity (Gibbs' Inequality)

D_{KL}(P || Q) \geq 0

KL Divergence is always non-negative, and equals zero if and only if $P = Q$ everywhere.

Why? From Jensen's Inequality:

-D_{KL}(P || Q) = \sum_x P(x) \ln\left(\frac{Q(x)}{P(x)}\right)

\leq \ln\left(\sum_x P(x) \cdot \frac{Q(x)}{P(x)}\right) = \ln(1) = 0

Since $-D_{KL} \leq 0$ , we have $D_{KL} \geq 0$ .

ML implication: You can never have "negative divergence." If your loss is negative, there's a bug in your code.

2. Asymmetry (Not a Metric)

D_{KL}(P || Q) \neq D_{KL}(Q || P)

Unlike Euclidean distance, KL Divergence is not symmetric. The "distance" from P to Q differs from Q to P.

Forward KL

$D_{KL}(P || Q)$

"How well does Q explain P?"

Reverse KL

$D_{KL}(Q || P)$

"How well does P explain Q?"

Why it fails as a metric: A true metric requires symmetry ( $d(a,b) = d(b,a)$ ) and triangle inequality. KL satisfies neither.

3. Convexity

$D_{KL}(P || Q)$ is convex in the pair $(P, Q)$ . For any $\lambda \in [0, 1]$ :

D_{KL}(\lambda P_1 + (1-\lambda)P_2 \,||\, \lambda Q_1 + (1-\lambda)Q_2) \leq \lambda D_{KL}(P_1 || Q_1) + (1-\lambda) D_{KL}(P_2 || Q_2)

Mixing distributions reduces or maintains the divergence. This property ensures:

Gradient descent converges to global minimum (for fixed P)
No local minima traps in variational inference
Optimization is computationally tractable

4. Additive for Independent Variables

If $X$ and $Y$ are independent under both P and Q:

D_{KL}(P_{XY} || Q_{XY}) = D_{KL}(P_X || Q_X) + D_{KL}(P_Y || Q_Y)

Derivation:

D_{KL}(P_{XY} || Q_{XY}) = \sum_{x,y} P(x,y) \ln\frac{P(x,y)}{Q(x,y)}

Since independent:

P(x,y) = P(x)P(y)

and

Q(x,y) = Q(x)Q(y)

= \sum_{x,y} P(x)P(y) \ln\frac{P(x)P(y)}{Q(x)Q(y)}

= \sum_{x,y} P(x)P(y) \left[\ln\frac{P(x)}{Q(x)} + \ln\frac{P(y)}{Q(y)}\right]

= D_{KL}(P_X || Q_X) + D_{KL}(P_Y || Q_Y)

ML implication: For high-dimensional data with independent features, you can compute KL divergence feature-by-feature and sum them up.

5. Invariance Under Reparameterization

For any invertible transformation $y = f(x)$ :

D_{KL}(P_X || Q_X) = D_{KL}(P_Y || Q_Y)

KL Divergence only depends on the probability values, not how the space is parameterized. Scaling, rotating, or nonlinearly transforming your features does not change the divergence.

ML Applications

Variational Autoencoders (VAEs)

The VAE loss has two parts: Reconstruction Loss + KL Divergence. The KL term forces the learned latent distribution q(z|x) to be close to a standard Normal prior N(0, I).

\mathcal{L} = -E_q[\log p(x|z)] + D_{KL}(q(z|x) || p(z))

This regularization ensures smooth, interpretable latent spaces for generation.

t-SNE Visualization

t-SNE minimizes the KL Divergence between the distribution of pairwise distances in high-dimensional space and low-dimensional space. This preserves local structure while reducing dimensions.

Critical for visualizing embeddings, gene expression data, and neural network activations.

Knowledge Distillation

Training a small "student" model to mimic a large "teacher" model. The loss is the KL Divergence between the teacher's softmax outputs (with temperature) and the student's outputs.

Used to compress BERT → DistilBERT, reducing size by 40% with 97% performance.

Reinforcement Learning (TRPO/PPO)

Trust Region Policy Optimization constrains policy updates so the new policy doesn't diverge too much from the old policy, measured by KL Divergence. PPO uses a clipped surrogate objective.

Powers OpenAI's robotic control, Dota 2 agents, and ChatGPT's RLHF training.

Generative Adversarial Networks (GANs)

Some GAN formulations (f-GAN) use KL or reverse KL divergence. The choice affects whether the generator covers all modes (forward KL) or focuses on high-quality single modes (reverse KL).

Explains mode collapse: reverse KL encourages sharp, single-mode outputs.

Bayesian Model Selection

KL Divergence measures how well an approximate posterior matches the true posterior in variational Bayes. Also used in Akaike Information Criterion (AIC) for model comparison.

Theoretical foundation for choosing between competing statistical models.

Contents

Introduction

Not a True Metric

Intuition: Extra Bits

KL Divergence = Extra Bits

The Formula

Critical Warning

For Continuous Distributions

Interactive: KL Between Gaussians

KL Divergence

Asymmetry: Not a True Distance

Worked Example

What are "nats"?

Forward vs Reverse KL

Forward KL: DKL(P || Q)

Reverse KL: DKL(Q || P)

Multimodal Example

Interactive: Forward vs Reverse

Forward vs Reverse KL

Zero-Avoiding (Mass Covering)

Relation to Entropy

Why This Connection Matters

Interactive: Entropy Relationship

The Information Theory Identity

Key Properties of KL Divergence

1. Non-Negativity (Gibbs' Inequality)

2. Asymmetry (Not a Metric)

3. Convexity

4. Additive for Independent Variables

5. Invariance Under Reparameterization

ML Applications

Variational Autoencoders (VAEs)

t-SNE Visualization

Knowledge Distillation

Reinforcement Learning (TRPO/PPO)

Generative Adversarial Networks (GANs)

Bayesian Model Selection

Forward KL: D_KL(P || Q)

Reverse KL: D_KL(Q || P)