Is Jensen-Shannon divergence symmetric?

Yes, Jensen-Shannon divergence is symmetric: JSD(P||Q) = JSD(Q||P). Unlike KL divergence, JSD treats both distributions equally by comparing them to their mixture M = 0.5(P + Q). The order does not matter, making it ideal for similarity measures where neither distribution is privileged as ground truth.

Does Jensen-Shannon divergence satisfy the triangle inequality?

JSD itself does not satisfy the triangle inequality. However, its square root does! The Jensen-Shannon Distance D_JS(P,Q) = √JSD(P||Q) satisfies all metric axioms including the triangle inequality: D(P,R) ≤ D(P,Q) + D(Q,R).

Is Jensen-Shannon divergence bounded?

Yes, Jensen-Shannon divergence is bounded between 0 and log(2). Using base-2 logarithm, JSD ranges from 0 to 1 bit. Using natural log, it ranges from 0 to ln(2) ≈ 0.693 nats. JSD equals 0 when distributions are identical and reaches maximum when they have completely disjoint support.

What is the maximum value of Jensen-Shannon divergence?

The maximum value of JSD is log(2), which equals 1 bit (base-2) or ln(2) ≈ 0.693 nats (natural log). This maximum occurs when two distributions have completely disjoint support - wherever P is positive, Q is zero, and vice versa.

Is Jensen-Shannon divergence a metric?

JSD itself is not a metric because it doesn't satisfy the triangle inequality. However, its square root - called the Jensen-Shannon Distance - is a true metric satisfying all four axioms: non-negativity, identity, symmetry, and triangle inequality.

What is the difference between Jensen-Shannon and KL divergence?

Key differences: (1) JSD is symmetric while KL is asymmetric, (2) JSD is bounded [0,1] while KL is unbounded [0,∞), (3) JSD handles disjoint support (gives 1) while KL becomes infinite, (4) √JSD satisfies triangle inequality while KL does not. JSD is computed as the average KL divergence of both distributions from their mixture.

How is Jensen-Shannon divergence used in GANs?

The original GAN objective minimizes JSD between real and generated data distributions. When the discriminator is optimal, the generator's objective equals 2×JSD(p_data||p_G) - log(4). However, JSD saturates when distributions don't overlap, causing vanishing gradients. This led to alternatives like Wasserstein GAN.

Jensen-Shannon Divergence: Symmetric, Bounded & Interactive

Introduction

KL Divergence is a powerful tool for measuring the difference between probability distributions. However, it has two fundamental limitations:

Asymmetry: KL(P||Q) is not equal to KL(Q||P). The "direction" matters.
Unboundedness: If P(x) > 0 but Q(x) = 0 for any x, KL divergence becomes infinite.

The Jensen-Shannon Divergence (JSD), introduced by Jianhua Lin in 1991, elegantly solves both problems. It measures how much each distribution diverges from their average mixture, creating a symmetric and bounded measure.

Historical Note

JSD is named after Johan Jensen (of Jensen's Inequality) and Claude Shannon (the father of information theory). It was designed to create a "smoothed" and symmetric version of KL divergence for practical applications.

Definition & Formula

First, define the mixture distribution $M$ :

M = \frac{1}{2}(P + Q)

JSD is defined as the average KL divergence of P and Q from this mixture M:

JSD(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)

Intuition: Instead of directly comparing P to Q (which is asymmetric), we compare both P and Q to their "meeting point" M. This naturally creates symmetry.

Alternative Formulations

JSD can be expressed in several equivalent ways, each providing different intuitions:

1. Entropy-Based Form

JSD(P \| Q) = H(M) - \frac{1}{2}H(P) - \frac{1}{2}H(Q)

JSD equals the entropy of the mixture minus the average entropy of P and Q. This shows JSD measures the "extra uncertainty" introduced by mixing.

2. Mutual Information Form

JSD(P \| Q) = I(X; Z)

Where Z is a binary variable indicating whether a sample comes from P or Q. JSD equals the mutual information between the sample and its source.

3. Expanded Sum Form

JSD(P \| Q) = \frac{1}{2} \sum_x P(x) \log \frac{2P(x)}{P(x)+Q(x)} + \frac{1}{2} \sum_x Q(x) \log \frac{2Q(x)}{P(x)+Q(x)}

Interactive Simulator

Drag the distributions apart. Notice that even when they barely overlap (where KL would approach infinity), JSD remains bounded between 0 and 1. The dashed purple line shows the mixture distribution M.

JSD Simulator

P(x)

Q(x)

M(x)

Distribution P

Mean (μ)-1.5

Std Dev (σ)1.0

Distribution Q

Mean (μ)1.5

Std Dev (σ)1.0

KL(P || M)

0.760

KL(Q || M)

0.760

Jensen-Shannon (JSD)

0.760

Max: 1.0

Step-by-Step Calculation

Let's calculate JSD for two simple discrete distributions:

Distribution P

P = [0.9, 0.1]

Distribution Q

Q = [0.1, 0.9]

Step 1: Compute Mixture M

M = 0.5 × P + 0.5 × Q = [0.5, 0.5]

Step 2: Compute KL(P||M)

= 0.9 × log₂(0.9/0.5) + 0.1 × log₂(0.1/0.5)

= 0.9 × 0.848 + 0.1 × (-2.322) ≈ 0.531 bits

Step 3: Compute KL(Q||M)

= 0.1 × log₂(0.1/0.5) + 0.9 × log₂(0.9/0.5)

= 0.1 × (-2.322) + 0.9 × 0.848 ≈ 0.531 bits

Step 4: Final JSD

JSD = 0.5 × 0.531 + 0.5 × 0.531 = 0.531 bits

Notice: KL(P||M) = KL(Q||M) because P and Q are "mirror images." This symmetry is a special case.

Key Properties

1. Symmetry

JSD(P \| Q) = JSD(Q \| P)

Unlike KL Divergence, JSD treats both distributions equally. The order does not matter.

Why is this true? The mixture $M = \frac{1}{2}(P + Q)$ is symmetric in P and Q. Since JSD measures the average divergence of P and Q from this same mixture, swapping P and Q changes nothing.

Practical implication: This makes JSD ideal for similarity/distance measures where neither distribution is privileged as "ground truth." For example, comparing two documents or two clusters where both are equally valid.

2. Bounded Range

0 \leq JSD(P \| Q) \leq \log(2)

JSD is always bounded. Using base-2 logarithm, the range is [0, 1] bits. Using natural log (common in ML), the range is [0, ln(2)] ≈ [0, 0.693] nats.

Why bounded? Even when P and Q have completely disjoint support (no overlap at all), JSD remains finite. At each point x, either P(x) = 0 or Q(x) = 0 (or both), but M(x) is always positive because $M(x) = \frac{P(x) + Q(x)}{2}$ .

Minimum (JSD = 0): When P = Q everywhere. The distributions are identical.

Maximum (JSD = log 2): When P and Q have completely disjoint support. Wherever P is positive, Q is zero, and vice versa. They share no probability mass.

3. Zero If and Only If Equal

JSD(P \| Q) = 0 \iff P = Q \text{ everywhere}

JSD equals zero only when the two distributions are identical at every point. This is a strict equality condition.

Proof sketch: JSD is a sum of two KL divergences, both of which are non-negative. For the sum to be zero, both must be zero. $D_{KL}(P \| M) = 0$ implies P = M, and $D_{KL}(Q \| M) = 0$ implies Q = M. Therefore P = Q = M.

Contrast with KL: KL Divergence can be zero in one direction but not the other if the supports differ. JSD's zero condition is cleaner and more intuitive.

4. Convexity

JSD(\lambda P_1 + (1-\lambda)P_2 \| Q) \leq \lambda \cdot JSD(P_1 \| Q) + (1-\lambda) \cdot JSD(P_2 \| Q)

JSD is convex in both of its arguments. Mixing two distributions and then computing JSD gives a value no larger than the weighted average of individual JSDs.

Why this matters for optimization: Convexity guarantees that gradient-based methods won't get stuck in local minima when minimizing JSD. Any local minimum is also a global minimum.

GAN training: This convexity is partly why the original GAN objective (which minimizes JSD) has nice theoretical properties, even though practical training can still be unstable due to the discriminator's role.

5. Relationship to Mutual Information

JSD(P \| Q) = I(X; Z)

JSD equals the mutual information between a sample X and a binary indicator Z that tells you which distribution (P or Q) the sample came from.

Interpretation: JSD measures how much information the value of a sample provides about its source distribution. If JSD is high, you can easily tell whether a sample came from P or Q. If JSD is zero, the distributions are identical and the source is unknowable.

Upper bound: Since $I(X; Z) \leq H(Z)$ and Z is binary with equal weights, $H(Z) = \log(2)$ . This proves the upper bound of JSD.

6. Smoothness (No Singularities)

P(x) = 0 \text{ or } Q(x) = 0 \Rightarrow \text{JSD still finite}

Unlike KL Divergence, JSD never becomes infinite, even when distributions have non-overlapping support.

Why KL fails: In $D_{KL}(P \| Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$ , if Q(x) = 0 but P(x) > 0, you get $\log(\frac{\text{something}}{0}) = \infty$ .

Why JSD works: JSD compares to the mixture M, not directly to Q. Since $M(x) = \frac{P(x) + Q(x)}{2}$ , if P(x) > 0 then M(x) > 0. No division by zero ever occurs.

Practical benefit: You can safely compute JSD for any pair of distributions without worrying about numerical instabilities or needing to add smoothing constants.

JSD vs KL Divergence

Property	KL Divergence	JSD
Symmetry	✗ Asymmetric	✓ Symmetric
Range	[0, ∞)	[0, 1]
Disjoint Support	Undefined (∞)	Defined (= 1)
Triangle Inequality	✗ No	✓ √JSD satisfies it
Computation	Faster (one direction)	Slower (2 KL + mixture)

Generalized JSD

The standard JSD uses equal weights (0.5, 0.5). The generalized JSD allows arbitrary weights:

JSD_{\pi}(P_1, ..., P_n) = H\left(\sum_i \pi_i P_i\right) - \sum_i \pi_i H(P_i)

Use Case: Unequal Importance

If you have a "ground truth" distribution P that matters more than an "approximation" Q, you can weight P higher (e.g., π = [0.8, 0.2]).

The generalized form can also compare multiple distributions at once, not just two. This is useful for comparing multiple models or clusters.

The GAN Connection

Generative Adversarial Networks (GANs) have a deep connection to JSD that explains both their power and their famous training instability.

Original GAN Objective

The minimax game between Generator (G) and Discriminator (D):

\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1-D(G(z)))]

When D is optimal, this objective equals 2 × JSD(p_data || p_G) - log(4).

The Vanishing Gradient Problem

When p_data and p_G have non-overlapping support (very common early in training):

JSD saturates at exactly log(2) ≈ 0.693
The gradient of JSD with respect to G becomes zero
The Generator receives no learning signal
Training stalls or fails entirely

The Solution: Wasserstein GAN

WGAN replaces JSD with the Wasserstein (Earth Mover's) distance. Unlike JSD, Wasserstein distance provides meaningful gradients even when distributions don't overlap. This led to more stable GAN training.

Interactive: GAN Training Dynamics

Drag the slider to separate the generator distribution from real data. Watch how JSD saturates and gradients vanish when distributions don't overlap, while Wasserstein distance maintains useful gradients.

Generator vs Real Data Distribution

Good Overlap

Real Data

P_{data}

Generator

P_G

Mixture

M

JSD

0.000

max = log(2) ≈ 1.0

Gradient

0.007

Vanishing!

Wasserstein

0.00

Always has gradient

Distribution Separation0.0 units apart

IdenticalCompletely Separate

Compare with Wasserstein (WGAN)

!

Vanishing Gradient Problem

JSD has saturated at its maximum (~log 2). The generator receives no learning signal because $\nabla_G JSD \approx 0$ . Training stalls. This is why original GANs were unstable.

A True Metric

JSD itself is NOT a metric because it doesn't satisfy the triangle inequality. However, its square root does!

D_{JS}(P, Q) = \sqrt{JSD(P \| Q)}

This is called the Jensen-Shannon Distance.

Metric Axioms Satisfied

1. Non-negativity: D(P, Q) ≥ 0
2. Identity: D(P, Q) = 0 iff P = Q
3. Symmetry: D(P, Q) = D(Q, P)
4. Triangle Inequality: D(P, R) ≤ D(P, Q) + D(Q, R)

ML Applications

Document Similarity

Compare word frequency distributions or topic distributions from LDA. JSD's symmetry makes it ideal for document clustering.

Clustering Evaluation

Compare cluster assignment distributions between two clusterings. How similar are two different k-means solutions?

Distribution Shift Detection

Track JSD between training and serving data distributions. High JSD indicates data drift requiring model retraining.

Bioinformatics

Compare gene expression profiles, protein sequence distributions, or microbiome compositions across samples.

Contents

Introduction

Historical Note

Definition & Formula

Alternative Formulations

1. Entropy-Based Form

2. Mutual Information Form

3. Expanded Sum Form

Interactive Simulator

JSD Simulator

Step-by-Step Calculation

Distribution P

Distribution Q

Step 1: Compute Mixture M

Step 2: Compute KL(P||M)

Step 3: Compute KL(Q||M)

Step 4: Final JSD

Key Properties

1. Symmetry

2. Bounded Range

3. Zero If and Only If Equal

4. Convexity

5. Relationship to Mutual Information

6. Smoothness (No Singularities)

JSD vs KL Divergence

Generalized JSD

Use Case: Unequal Importance

The GAN Connection

Original GAN Objective

The Vanishing Gradient Problem

The Solution: Wasserstein GAN

Interactive: GAN Training Dynamics

Generator vs Real Data Distribution

A True Metric

Metric Axioms Satisfied

ML Applications

Document Similarity

Clustering Evaluation

Distribution Shift Detection

Bioinformatics