What is Shannon entropy and what does it measure?

Shannon entropy is a mathematical measure of uncertainty, surprise, and information content in a probability distribution, introduced by Claude Shannon in his 1948 paper 'A Mathematical Theory of Communication.' It quantifies the average number of bits needed to describe outcomes from a random source. High entropy means high uncertainty (outcomes are hard to predict); low entropy means low uncertainty (outcomes are predictable). The formula is H(X) = -Σ P(x) log₂ P(x).

How is entropy calculated in bits?

Entropy in bits is calculated using the formula H(X) = -Σ P(x) log₂ P(x), where the sum runs over all possible outcomes and P(x) is the probability of each outcome. For each outcome multiply its probability by the base-2 logarithm of its probability, sum these values, and negate the result. For example, a fair coin with P(H)=0.5 and P(T)=0.5 gives H = -(0.5×log₂0.5 + 0.5×log₂0.5) = 1 bit. Using base-2 logarithms gives entropy in bits; using natural logarithms gives entropy in nats.

What is maximum entropy and when does it occur?

Maximum entropy occurs when all outcomes are equally likely, i.e., under a uniform distribution. For a variable with n possible outcomes, the maximum entropy is H_max = log₂(n) bits. Any deviation from uniformity — making some outcomes more likely than others — reduces entropy below this maximum. For example, a fair die with 6 equally likely outcomes has maximum entropy log₂(6) ≈ 2.58 bits, whereas a biased die has strictly lower entropy.

How is Shannon entropy used in decision trees?

Decision tree algorithms such as ID3 and C4.5 use Shannon entropy to choose which feature to split on at each node. They compute 'information gain' — the reduction in entropy achieved by a split — and select the feature that maximises this reduction. Before a split the dataset has some entropy reflecting class uncertainty; after splitting on a feature the weighted average entropy of the child nodes is lower. The feature that produces the greatest entropy reduction (information gain) is selected, yielding the most informative split.

What is the relationship between entropy and cross-entropy?

Shannon entropy H(P) = -Σ P(x) log₂ P(x) measures the irreducible uncertainty in a true distribution P. Cross-entropy H(P, Q) = -Σ P(x) log₂ Q(x) measures the average number of bits needed to encode data from distribution P when using a code optimised for a different distribution Q. Cross-entropy equals entropy plus KL divergence: H(P,Q) = H(P) + D_KL(P‖Q). In machine learning, cross-entropy is widely used as a loss function for classification: minimising cross-entropy between the true label distribution and the model's predicted distribution drives the model to approximate the true distribution as closely as possible.

What is information content or surprisal?

Information content (surprisal or self-information) measures the information carried by a single outcome: I(x) = -log₂ P(x). Rare events (low probability) have high surprisal; certain events (probability 1) have zero surprisal. Shannon entropy is the expected value of surprisal across all outcomes of a distribution.

What is the Shannon entropy formula?

The Shannon entropy formula is H(X) = -Σ P(x) log₂ P(x), where the sum is over all possible outcomes x and P(x) is the probability of each outcome. Using base-2 logarithm gives entropy in bits. Using the natural logarithm gives entropy in nats.

Who invented Shannon entropy?

Claude Shannon invented entropy in his 1948 paper 'A Mathematical Theory of Communication,' which founded the field of information theory. Shannon defined entropy as a precise mathematical measure of information and uncertainty, making 'information' a quantifiable concept for the first time.

Shannon Entropy: Formula, Bits & Interactive Calculator

Introduction

In 1948, Claude Shannon published "A Mathematical Theory of Communication," creating the field of Information Theory. He asked a deceptively simple question: "How do we measure information?"

Before Shannon, "information" was vague and philosophical. After Shannon, information became a precise, measurable quantity, as fundamental as mass or energy. This single insight underlies everything from data compression (ZIP files) to neural networks (Cross-Entropy loss).

Shannon's Key Insight

Information is the resolution of uncertainty. If you already knew something with certainty, telling you provides zero information. If you didn't know it at all, telling you provides maximum information.

Intuition: Measuring Surprise

Think of information as surprise. Unlikely events are surprising; likely events are not.

Low Information (No Surprise)

"The sun rose this morning."

P(sun rises) = 1.0. You knew this would happen. The message tells you nothing you didn't already know.

High Information (Surprising)

"It snowed in the Sahara Desert today."

P(Sahara snow) = 0.0001. This is shocking! The message carries enormous information.

"Information is inversely proportional to probability."

Bits: The Unit of Information

Shannon defined the unit of information as the bit (binary digit). One bit is the amount of information gained from learning the outcome of a fair coin flip.

How Many Bits?

bit

2 equally likely outcomes (coin flip)

bits

4 outcomes (00, 01, 10, 11)

bits

8 outcomes (2^3)

\text{Bits needed} = \log_2(\text{Number of equally likely outcomes})

To identify one person among 8 billion, you need $\log_2(8 \times 10^9) \approx 33$ bits. That's 33 yes/no questions.

Information Content (Surprisal)

If an event $x$ has probability $P(x)$ , its information content (also called "surprisal" or "self-information") is:

I(x) = -\log_2 P(x) = \log_2 \frac{1}{P(x)}

Measured in bits when using $\log_2$ . Use $\ln$ for "nats" (natural units).

Why Logarithm?

Information is additive. Two independent coin flips provide $1 + 1 = 2$ bits of information.

Probabilities multiply ( $0.5 \times 0.5 = 0.25$ ). Logarithms turn multiplication into addition:

\log(P_1 \cdot P_2) = \log(P_1) + \log(P_2)

P(x) = 1.0

I(x) = 0 bits

Certainty: No surprise

P(x) = 0.5

I(x) = 1 bit

Fair coin flip

P(x) = 0.001

I(x) = 9.97 bits

Rare event: Very surprising

Interactive: Information Content

Explore the inverse relationship between probability and information. Drag the slider to see how rare events carry more information!

Information Content Visualization

See how information content I(x) = -log₂(P(x)) changes with probability. Rare events carry more information!

Probability: 0.500

0.01 (Very Rare)0.5 (Fair)1.0 (Certain)

Probability

0.500

Information Content

1.00 bits

Interpretation

Moderate

Key Observations

As probability approaches 0, information content goes to infinity (very surprising events)
When P(x) = 1.0, I(x) = 0 (certain events carry no information)
When P(x) = 0.5, I(x) = 1 bit (a fair coin flip)
The relationship is logarithmic: information adds when probabilities multiply

Entropy: The Expected Surprise

$I(x)$ is the information for a single specific event. Entropy $H$ is the average information content of the entire probability distribution. It tells us how "unpredictable" the source is on average.

H(X) = E[I(X)] = -\sum_{x} P(x) \log_2 P(x)

Entropy is the expected value of information content.

Low Entropy (Predictable)

Biased coin: P(H) = 0.99, P(T) = 0.01

H = 0.08 bits. The outcome is almost certain.

High Entropy (Uncertain)

Fair coin: P(H) = 0.5, P(T) = 0.5

H = 1.0 bit. Maximum uncertainty for 2 outcomes.

Interactive: Binary Entropy

Explore how entropy changes as you adjust the probability. Notice that entropy is maximized when the outcome is most uncertain.

Binary Entropy Function

Uncertainty in a coin flip

Max: 1 bit

Probability Distribution

HEADS

TAILS

0.500.50

Adjust P(Heads)

Surprise (Bits)

If Heads1.00 bits

If Tails1.00 bits

Current Entropy

1.00

bits

Maximum Uncertainty

When P(Heads) = 0.5, the outcome is completely unpredictable. You need exactly 1 bit of information to know the result.

Interactive: Entropy Comparison

Compare entropy across different probability distributions. See how uniformity maximizes entropy!

Entropy Comparison

Compare entropy across different probability distributions. Uniform = highest entropy, deterministic = zero entropy.

Probability Distribution

Entropy Level

Max Entropy: 2.0 bits

100% of maximum

Entropy Calculation

H(X) = -\sum_{i=1}^{4} P(x_i) \log_2 P(x_i)

-0.25 × log₂(0.25) = 0.500

H(X) = 2.000 bits

Uniform: All outcomes equally likely - maximum entropy

Key Properties of Entropy

1. Non-Negativity

$H(X) \ge 0$ . Entropy is never negative. The minimum is 0 (complete certainty).

2. Maximum for Uniform Distribution

For n outcomes, entropy is maximized when all outcomes are equally likely: $H_{max} = \log_2(n)$ .

3. Additivity for Independent Variables

If X and Y are independent: $H(X, Y) = H(X) + H(Y)$ . Joint uncertainty is the sum of individual uncertainties.

4. Conditioning Reduces Entropy

$H(X|Y) \le H(X)$ . Knowing something about Y can only reduce (or maintain) uncertainty about X. This is related to Mutual Information.

Worked Examples

Example 1: Fair Coin

$P(H) = 0.5, P(T) = 0.5$

H(X) = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5)

H(X) = -(0.5 \times (-1) + 0.5 \times (-1)) = 1 \text{ bit}

Example 2: Fair Die (6 sides)

Each outcome has probability $1/6$ .

H(X) = -6 \times \frac{1}{6} \log_2 \frac{1}{6} = \log_2 6 \approx 2.58 \text{ bits}

Example 3: English Text

Letters in English are not uniformly distributed. 'e' appears ~12%, 'z' appears ~0.07%.

Shannon estimated English text has about 1-1.5 bits per character, far less than the 4.7 bits needed for uniform 26 letters. This redundancy enables compression!

Connection to Coding Theory

Entropy is not just abstract. It directly tells us the minimum number of bits needed to encode messages from a source.

Shannon's Source Coding Theorem

You cannot compress data below its entropy. On average, you need at least $H(X)$ bits per symbol.

Example: A source with entropy 2 bits/symbol cannot be compressed below 2 bits/symbol on average, no matter how clever the algorithm.

Huffman Coding

Assign short codes to frequent symbols, long codes to rare symbols. This is how ZIP, JPEG, and MP3 work. The average code length approaches entropy.

ML Applications

Cross-Entropy Loss

The most common loss function for classification. It measures how well a model's predicted distribution $Q$ matches the true distribution $P$ . See the Cross-Entropy page for details.

Decision Trees (Information Gain)

Trees choose splits that maximize Information Gain = reduction in entropy. The feature that most reduces uncertainty is chosen. See Information Gain.

Variational Autoencoders (KL Divergence)

VAEs use KL Divergence to measure the difference between the learned latent distribution and a prior (usually Gaussian).

Contents

Introduction

Shannon's Key Insight

Intuition: Measuring Surprise

Low Information (No Surprise)

High Information (Surprising)

Bits: The Unit of Information

How Many Bits?

Information Content (Surprisal)

Why Logarithm?

Interactive: Information Content

Information Content Visualization

Key Observations

Entropy: The Expected Surprise

Low Entropy (Predictable)

High Entropy (Uncertain)

Interactive: Binary Entropy

Binary Entropy Function

Maximum Uncertainty

Interactive: Entropy Comparison

Entropy Comparison

Probability Distribution

Entropy Level

Entropy Calculation

Key Properties of Entropy

1. Non-Negativity

2. Maximum for Uniform Distribution

3. Additivity for Independent Variables

4. Conditioning Reduces Entropy

Worked Examples

Example 1: Fair Coin

Example 2: Fair Die (6 sides)

Example 3: English Text

Connection to Coding Theory

Shannon's Source Coding Theorem

Huffman Coding

ML Applications

Cross-Entropy Loss

Decision Trees (Information Gain)

Variational Autoencoders (KL Divergence)