Modules
07/07
Information Theory

Contents

Jensen-Shannon Divergence

A symmetric, smoothed version of KL Divergence. The metric that powers GANs.

Introduction

KL Divergence is a powerful tool for measuring the difference between probability distributions. However, it has two fundamental limitations:

  • Asymmetry: KL(P||Q) is not equal to KL(Q||P). The "direction" matters.
  • Unboundedness: If P(x) > 0 but Q(x) = 0 for any x, KL divergence becomes infinite.

The Jensen-Shannon Divergence (JSD), introduced by Jianhua Lin in 1991, elegantly solves both problems. It measures how much each distribution diverges from their average mixture, creating a symmetric and bounded measure.

Historical Note

JSD is named after Johan Jensen (of Jensen's Inequality) and Claude Shannon (the father of information theory). It was designed to create a "smoothed" and symmetric version of KL divergence for practical applications.

Definition & Formula

First, define the mixture distribution MM:

M=12(P+Q)M = \frac{1}{2}(P + Q)

JSD is defined as the average KL divergence of P and Q from this mixture M:

JSD(PQ)=12DKL(PM)+12DKL(QM)JSD(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)

Intuition: Instead of directly comparing P to Q (which is asymmetric), we compare both P and Q to their "meeting point" M. This naturally creates symmetry.

Alternative Formulations

JSD can be expressed in several equivalent ways, each providing different intuitions:

1. Entropy-Based Form

JSD(PQ)=H(M)12H(P)12H(Q)JSD(P \| Q) = H(M) - \frac{1}{2}H(P) - \frac{1}{2}H(Q)

JSD equals the entropy of the mixture minus the average entropy of P and Q. This shows JSD measures the "extra uncertainty" introduced by mixing.

2. Mutual Information Form

JSD(PQ)=I(X;Z)JSD(P \| Q) = I(X; Z)

Where Z is a binary variable indicating whether a sample comes from P or Q. JSD equals the mutual information between the sample and its source.

3. Expanded Sum Form

JSD(PQ)=12xP(x)log2P(x)P(x)+Q(x)+12xQ(x)log2Q(x)P(x)+Q(x)JSD(P \| Q) = \frac{1}{2} \sum_x P(x) \log \frac{2P(x)}{P(x)+Q(x)} + \frac{1}{2} \sum_x Q(x) \log \frac{2Q(x)}{P(x)+Q(x)}

Interactive Simulator

Drag the distributions apart. Notice that even when they barely overlap (where KL would approach infinity), JSD remains bounded between 0 and 1. The dashed purple line shows the mixture distribution M.

JSD Simulator

P(x)
Q(x)
M(x)
Distribution P
Mean (μ)-1.5
Std Dev (σ)1.0
Distribution Q
Mean (μ)1.5
Std Dev (σ)1.0
KL(P || M)
0.760
KL(Q || M)
0.760
Jensen-Shannon (JSD)
0.760
Max: 1.0

Step-by-Step Calculation

Let's calculate JSD for two simple discrete distributions:

Distribution P

P = [0.9, 0.1]

Distribution Q

Q = [0.1, 0.9]

Step 1: Compute Mixture M

M = 0.5 × P + 0.5 × Q = [0.5, 0.5]

Step 2: Compute KL(P||M)

= 0.9 × log₂(0.9/0.5) + 0.1 × log₂(0.1/0.5)

= 0.9 × 0.848 + 0.1 × (-2.322) ≈ 0.531 bits

Step 3: Compute KL(Q||M)

= 0.1 × log₂(0.1/0.5) + 0.9 × log₂(0.9/0.5)

= 0.1 × (-2.322) + 0.9 × 0.848 ≈ 0.531 bits

Step 4: Final JSD

JSD = 0.5 × 0.531 + 0.5 × 0.531 = 0.531 bits

Notice: KL(P||M) = KL(Q||M) because P and Q are "mirror images." This symmetry is a special case.

Key Properties

1. Symmetry

JSD(PQ)=JSD(QP)JSD(P \| Q) = JSD(Q \| P)

Unlike KL Divergence, JSD treats both distributions equally. The order does not matter.

Why is this true? The mixture M=12(P+Q)M = \frac{1}{2}(P + Q) is symmetric in P and Q. Since JSD measures the average divergence of P and Q from this same mixture, swapping P and Q changes nothing.

Practical implication: This makes JSD ideal for similarity/distance measures where neither distribution is privileged as "ground truth." For example, comparing two documents or two clusters where both are equally valid.

2. Bounded Range

0JSD(PQ)log(2)0 \leq JSD(P \| Q) \leq \log(2)

JSD is always bounded. Using base-2 logarithm, the range is [0, 1] bits. Using natural log (common in ML), the range is [0, ln(2)] ≈ [0, 0.693] nats.

Why bounded? Even when P and Q have completely disjoint support (no overlap at all), JSD remains finite. At each point x, either P(x) = 0 or Q(x) = 0 (or both), but M(x) is always positive because M(x)=P(x)+Q(x)2M(x) = \frac{P(x) + Q(x)}{2}.

Minimum (JSD = 0): When P = Q everywhere. The distributions are identical.

Maximum (JSD = log 2): When P and Q have completely disjoint support. Wherever P is positive, Q is zero, and vice versa. They share no probability mass.

3. Zero If and Only If Equal

JSD(PQ)=0    P=Q everywhereJSD(P \| Q) = 0 \iff P = Q \text{ everywhere}

JSD equals zero only when the two distributions are identical at every point. This is a strict equality condition.

Proof sketch: JSD is a sum of two KL divergences, both of which are non-negative. For the sum to be zero, both must be zero. DKL(PM)=0D_{KL}(P \| M) = 0 implies P = M, and DKL(QM)=0D_{KL}(Q \| M) = 0 implies Q = M. Therefore P = Q = M.

Contrast with KL: KL Divergence can be zero in one direction but not the other if the supports differ. JSD's zero condition is cleaner and more intuitive.

4. Convexity

JSD(λP1+(1λ)P2Q)λJSD(P1Q)+(1λ)JSD(P2Q)JSD(\lambda P_1 + (1-\lambda)P_2 \| Q) \leq \lambda \cdot JSD(P_1 \| Q) + (1-\lambda) \cdot JSD(P_2 \| Q)

JSD is convex in both of its arguments. Mixing two distributions and then computing JSD gives a value no larger than the weighted average of individual JSDs.

Why this matters for optimization: Convexity guarantees that gradient-based methods won't get stuck in local minima when minimizing JSD. Any local minimum is also a global minimum.

GAN training: This convexity is partly why the original GAN objective (which minimizes JSD) has nice theoretical properties, even though practical training can still be unstable due to the discriminator's role.

5. Relationship to Mutual Information

JSD(PQ)=I(X;Z)JSD(P \| Q) = I(X; Z)

JSD equals the mutual information between a sample X and a binary indicator Z that tells you which distribution (P or Q) the sample came from.

Interpretation: JSD measures how much information the value of a sample provides about its source distribution. If JSD is high, you can easily tell whether a sample came from P or Q. If JSD is zero, the distributions are identical and the source is unknowable.

Upper bound: Since I(X;Z)H(Z)I(X; Z) \leq H(Z) and Z is binary with equal weights, H(Z)=log(2)H(Z) = \log(2). This proves the upper bound of JSD.

6. Smoothness (No Singularities)

P(x)=0 or Q(x)=0JSD still finiteP(x) = 0 \text{ or } Q(x) = 0 \Rightarrow \text{JSD still finite}

Unlike KL Divergence, JSD never becomes infinite, even when distributions have non-overlapping support.

Why KL fails: In DKL(PQ)=P(x)logP(x)Q(x)D_{KL}(P \| Q) = \sum P(x) \log \frac{P(x)}{Q(x)}, if Q(x) = 0 but P(x) > 0, you get log(something0)=\log(\frac{\text{something}}{0}) = \infty.

Why JSD works: JSD compares to the mixture M, not directly to Q. Since M(x)=P(x)+Q(x)2M(x) = \frac{P(x) + Q(x)}{2}, if P(x) > 0 then M(x) > 0. No division by zero ever occurs.

Practical benefit: You can safely compute JSD for any pair of distributions without worrying about numerical instabilities or needing to add smoothing constants.

JSD vs KL Divergence

PropertyKL DivergenceJSD
Symmetry✗ Asymmetric✓ Symmetric
Range[0, ∞)[0, 1]
Disjoint SupportUndefined (∞)Defined (= 1)
Triangle Inequality✗ No✓ √JSD satisfies it
ComputationFaster (one direction)Slower (2 KL + mixture)

Generalized JSD

The standard JSD uses equal weights (0.5, 0.5). The generalized JSD allows arbitrary weights:

JSDπ(P1,...,Pn)=H(iπiPi)iπiH(Pi)JSD_{\pi}(P_1, ..., P_n) = H\left(\sum_i \pi_i P_i\right) - \sum_i \pi_i H(P_i)

Use Case: Unequal Importance

If you have a "ground truth" distribution P that matters more than an "approximation" Q, you can weight P higher (e.g., π = [0.8, 0.2]).

The generalized form can also compare multiple distributions at once, not just two. This is useful for comparing multiple models or clusters.

The GAN Connection

Generative Adversarial Networks (GANs) have a deep connection to JSD that explains both their power and their famous training instability.

Original GAN Objective

The minimax game between Generator (G) and Discriminator (D):

minGmaxDV(D,G)=Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]

When D is optimal, this objective equals 2 × JSD(p_data || p_G) - log(4).

The Vanishing Gradient Problem

When p_data and p_G have non-overlapping support (very common early in training):

  • JSD saturates at exactly log(2) ≈ 0.693
  • The gradient of JSD with respect to G becomes zero
  • The Generator receives no learning signal
  • Training stalls or fails entirely

The Solution: Wasserstein GAN

WGAN replaces JSD with the Wasserstein (Earth Mover's) distance. Unlike JSD, Wasserstein distance provides meaningful gradients even when distributions don't overlap. This led to more stable GAN training.

Interactive: GAN Training Dynamics

Drag the slider to separate the generator distribution from real data. Watch how JSD saturates and gradients vanish when distributions don't overlap, while Wasserstein distance maintains useful gradients.

Generator vs Real Data Distribution

Good Overlap
P_realP_gen
Real Data PdataP_{data}
Generator PGP_G
Mixture MM
JSD
0.000
max = log(2) ≈ 1.0
Gradient
0.007
Vanishing!
Wasserstein
0.00
Always has gradient
Distribution Separation0.0 units apart
IdenticalCompletely Separate
!

Vanishing Gradient Problem

JSD has saturated at its maximum (~log 2). The generator receives no learning signal becauseGJSD0 \nabla_G JSD \approx 0. Training stalls. This is why original GANs were unstable.

A True Metric

JSD itself is NOT a metric because it doesn't satisfy the triangle inequality. However, its square root does!

DJS(P,Q)=JSD(PQ)D_{JS}(P, Q) = \sqrt{JSD(P \| Q)}
This is called the Jensen-Shannon Distance.

Metric Axioms Satisfied

  • 1. Non-negativity: D(P, Q) ≥ 0
  • 2. Identity: D(P, Q) = 0 iff P = Q
  • 3. Symmetry: D(P, Q) = D(Q, P)
  • 4. Triangle Inequality: D(P, R) ≤ D(P, Q) + D(Q, R)

ML Applications

Document Similarity

Compare word frequency distributions or topic distributions from LDA. JSD's symmetry makes it ideal for document clustering.

Clustering Evaluation

Compare cluster assignment distributions between two clusterings. How similar are two different k-means solutions?

Distribution Shift Detection

Track JSD between training and serving data distributions. High JSD indicates data drift requiring model retraining.

Bioinformatics

Compare gene expression profiles, protein sequence distributions, or microbiome compositions across samples.