Introduction
KL Divergence is a powerful tool for measuring the difference between probability distributions. However, it has two fundamental limitations:
- Asymmetry: KL(P||Q) is not equal to KL(Q||P). The "direction" matters.
- Unboundedness: If P(x) > 0 but Q(x) = 0 for any x, KL divergence becomes infinite.
The Jensen-Shannon Divergence (JSD), introduced by Jianhua Lin in 1991, elegantly solves both problems. It measures how much each distribution diverges from their average mixture, creating a symmetric and bounded measure.
Historical Note
JSD is named after Johan Jensen (of Jensen's Inequality) and Claude Shannon (the father of information theory). It was designed to create a "smoothed" and symmetric version of KL divergence for practical applications.
Definition & Formula
First, define the mixture distribution :
JSD is defined as the average KL divergence of P and Q from this mixture M:
Intuition: Instead of directly comparing P to Q (which is asymmetric), we compare both P and Q to their "meeting point" M. This naturally creates symmetry.
Alternative Formulations
JSD can be expressed in several equivalent ways, each providing different intuitions:
1. Entropy-Based Form
JSD equals the entropy of the mixture minus the average entropy of P and Q. This shows JSD measures the "extra uncertainty" introduced by mixing.
2. Mutual Information Form
Where Z is a binary variable indicating whether a sample comes from P or Q. JSD equals the mutual information between the sample and its source.
3. Expanded Sum Form
Interactive Simulator
Drag the distributions apart. Notice that even when they barely overlap (where KL would approach infinity), JSD remains bounded between 0 and 1. The dashed purple line shows the mixture distribution M.
JSD Simulator
Step-by-Step Calculation
Let's calculate JSD for two simple discrete distributions:
Distribution P
P = [0.9, 0.1]
Distribution Q
Q = [0.1, 0.9]
Step 1: Compute Mixture M
M = 0.5 × P + 0.5 × Q = [0.5, 0.5]
Step 2: Compute KL(P||M)
= 0.9 × log₂(0.9/0.5) + 0.1 × log₂(0.1/0.5)
= 0.9 × 0.848 + 0.1 × (-2.322) ≈ 0.531 bits
Step 3: Compute KL(Q||M)
= 0.1 × log₂(0.1/0.5) + 0.9 × log₂(0.9/0.5)
= 0.1 × (-2.322) + 0.9 × 0.848 ≈ 0.531 bits
Step 4: Final JSD
JSD = 0.5 × 0.531 + 0.5 × 0.531 = 0.531 bits
Notice: KL(P||M) = KL(Q||M) because P and Q are "mirror images." This symmetry is a special case.
Key Properties
1. Symmetry
Unlike KL Divergence, JSD treats both distributions equally. The order does not matter.
Why is this true? The mixture is symmetric in P and Q. Since JSD measures the average divergence of P and Q from this same mixture, swapping P and Q changes nothing.
Practical implication: This makes JSD ideal for similarity/distance measures where neither distribution is privileged as "ground truth." For example, comparing two documents or two clusters where both are equally valid.
2. Bounded Range
JSD is always bounded. Using base-2 logarithm, the range is [0, 1] bits. Using natural log (common in ML), the range is [0, ln(2)] ≈ [0, 0.693] nats.
Why bounded? Even when P and Q have completely disjoint support (no overlap at all), JSD remains finite. At each point x, either P(x) = 0 or Q(x) = 0 (or both), but M(x) is always positive because .
Minimum (JSD = 0): When P = Q everywhere. The distributions are identical.
Maximum (JSD = log 2): When P and Q have completely disjoint support. Wherever P is positive, Q is zero, and vice versa. They share no probability mass.
3. Zero If and Only If Equal
JSD equals zero only when the two distributions are identical at every point. This is a strict equality condition.
Proof sketch: JSD is a sum of two KL divergences, both of which are non-negative. For the sum to be zero, both must be zero. implies P = M, and implies Q = M. Therefore P = Q = M.
Contrast with KL: KL Divergence can be zero in one direction but not the other if the supports differ. JSD's zero condition is cleaner and more intuitive.
4. Convexity
JSD is convex in both of its arguments. Mixing two distributions and then computing JSD gives a value no larger than the weighted average of individual JSDs.
Why this matters for optimization: Convexity guarantees that gradient-based methods won't get stuck in local minima when minimizing JSD. Any local minimum is also a global minimum.
GAN training: This convexity is partly why the original GAN objective (which minimizes JSD) has nice theoretical properties, even though practical training can still be unstable due to the discriminator's role.
5. Relationship to Mutual Information
JSD equals the mutual information between a sample X and a binary indicator Z that tells you which distribution (P or Q) the sample came from.
Interpretation: JSD measures how much information the value of a sample provides about its source distribution. If JSD is high, you can easily tell whether a sample came from P or Q. If JSD is zero, the distributions are identical and the source is unknowable.
Upper bound: Since and Z is binary with equal weights, . This proves the upper bound of JSD.
6. Smoothness (No Singularities)
Unlike KL Divergence, JSD never becomes infinite, even when distributions have non-overlapping support.
Why KL fails: In , if Q(x) = 0 but P(x) > 0, you get .
Why JSD works: JSD compares to the mixture M, not directly to Q. Since , if P(x) > 0 then M(x) > 0. No division by zero ever occurs.
Practical benefit: You can safely compute JSD for any pair of distributions without worrying about numerical instabilities or needing to add smoothing constants.
JSD vs KL Divergence
| Property | KL Divergence | JSD |
|---|---|---|
| Symmetry | ✗ Asymmetric | ✓ Symmetric |
| Range | [0, ∞) | [0, 1] |
| Disjoint Support | Undefined (∞) | Defined (= 1) |
| Triangle Inequality | ✗ No | ✓ √JSD satisfies it |
| Computation | Faster (one direction) | Slower (2 KL + mixture) |
Generalized JSD
The standard JSD uses equal weights (0.5, 0.5). The generalized JSD allows arbitrary weights:
Use Case: Unequal Importance
If you have a "ground truth" distribution P that matters more than an "approximation" Q, you can weight P higher (e.g., π = [0.8, 0.2]).
The generalized form can also compare multiple distributions at once, not just two. This is useful for comparing multiple models or clusters.
The GAN Connection
Generative Adversarial Networks (GANs) have a deep connection to JSD that explains both their power and their famous training instability.
Original GAN Objective
The minimax game between Generator (G) and Discriminator (D):
When D is optimal, this objective equals 2 × JSD(p_data || p_G) - log(4).
The Vanishing Gradient Problem
When p_data and p_G have non-overlapping support (very common early in training):
- JSD saturates at exactly log(2) ≈ 0.693
- The gradient of JSD with respect to G becomes zero
- The Generator receives no learning signal
- Training stalls or fails entirely
The Solution: Wasserstein GAN
WGAN replaces JSD with the Wasserstein (Earth Mover's) distance. Unlike JSD, Wasserstein distance provides meaningful gradients even when distributions don't overlap. This led to more stable GAN training.
Interactive: GAN Training Dynamics
Drag the slider to separate the generator distribution from real data. Watch how JSD saturates and gradients vanish when distributions don't overlap, while Wasserstein distance maintains useful gradients.
Generator vs Real Data Distribution
Vanishing Gradient Problem
JSD has saturated at its maximum (~log 2). The generator receives no learning signal because. Training stalls. This is why original GANs were unstable.
A True Metric
JSD itself is NOT a metric because it doesn't satisfy the triangle inequality. However, its square root does!
Metric Axioms Satisfied
- 1. Non-negativity: D(P, Q) ≥ 0
- 2. Identity: D(P, Q) = 0 iff P = Q
- 3. Symmetry: D(P, Q) = D(Q, P)
- 4. Triangle Inequality: D(P, R) ≤ D(P, Q) + D(Q, R)
ML Applications
Document Similarity
Compare word frequency distributions or topic distributions from LDA. JSD's symmetry makes it ideal for document clustering.
Clustering Evaluation
Compare cluster assignment distributions between two clusterings. How similar are two different k-means solutions?
Distribution Shift Detection
Track JSD between training and serving data distributions. High JSD indicates data drift requiring model retraining.
Bioinformatics
Compare gene expression profiles, protein sequence distributions, or microbiome compositions across samples.