Introduction
In 1948, Claude Shannon published "A Mathematical Theory of Communication," creating the field of Information Theory. He asked a deceptively simple question: "How do we measure information?"
Before Shannon, "information" was vague and philosophical. After Shannon, information became a precise, measurable quantity, as fundamental as mass or energy. This single insight underlies everything from data compression (ZIP files) to neural networks (Cross-Entropy loss).
Shannon's Key Insight
Information is the resolution of uncertainty. If you already knew something with certainty, telling you provides zero information. If you didn't know it at all, telling you provides maximum information.
Intuition: Measuring Surprise
Think of information as surprise. Unlikely events are surprising; likely events are not.
Low Information (No Surprise)
"The sun rose this morning."
P(sun rises) = 1.0. You knew this would happen. The message tells you nothing you didn't already know.
High Information (Surprising)
"It snowed in the Sahara Desert today."
P(Sahara snow) = 0.0001. This is shocking! The message carries enormous information.
"Information is inversely proportional to probability."
Bits: The Unit of Information
Shannon defined the unit of information as the bit (binary digit). One bit is the amount of information gained from learning the outcome of a fair coin flip.
How Many Bits?
1
bit
2 equally likely outcomes (coin flip)
2
bits
4 outcomes (00, 01, 10, 11)
3
bits
8 outcomes (2^3)
To identify one person among 8 billion, you need bits. That's 33 yes/no questions.
Information Content (Surprisal)
If an event has probability , its information content (also called "surprisal" or "self-information") is:
Measured in bits when using . Use for "nats" (natural units).
Why Logarithm?
Information is additive. Two independent coin flips provide bits of information.
Probabilities multiply (). Logarithms turn multiplication into addition:
Interactive: Information Content
Explore the inverse relationship between probability and information. Drag the slider to see how rare events carry more information!
Information Content Visualization
See how information content I(x) = -log₂(P(x)) changes with probability. Rare events carry more information!
Key Observations
- As probability approaches 0, information content goes to infinity (very surprising events)
- When P(x) = 1.0, I(x) = 0 (certain events carry no information)
- When P(x) = 0.5, I(x) = 1 bit (a fair coin flip)
- The relationship is logarithmic: information adds when probabilities multiply
Entropy: The Expected Surprise
is the information for a single specific event. Entropy is the average information content of the entire probability distribution. It tells us how "unpredictable" the source is on average.
Entropy is the expected value of information content.
Low Entropy (Predictable)
Biased coin: P(H) = 0.99, P(T) = 0.01
H = 0.08 bits. The outcome is almost certain.
High Entropy (Uncertain)
Fair coin: P(H) = 0.5, P(T) = 0.5
H = 1.0 bit. Maximum uncertainty for 2 outcomes.
Interactive: Binary Entropy
Explore how entropy changes as you adjust the probability. Notice that entropy is maximized when the outcome is most uncertain.
Binary Entropy Function
Uncertainty in a coin flip
Maximum Uncertainty
When P(Heads) = 0.5, the outcome is completely unpredictable. You need exactly 1 bit of information to know the result.
Interactive: Entropy Comparison
Compare entropy across different probability distributions. See how uniformity maximizes entropy!
Entropy Comparison
Compare entropy across different probability distributions. Uniform = highest entropy, deterministic = zero entropy.
Probability Distribution
Entropy Level
Entropy Calculation
Uniform: All outcomes equally likely - maximum entropy
Key Properties of Entropy
1. Non-Negativity
. Entropy is never negative. The minimum is 0 (complete certainty).
2. Maximum for Uniform Distribution
For n outcomes, entropy is maximized when all outcomes are equally likely: .
3. Additivity for Independent Variables
If X and Y are independent: . Joint uncertainty is the sum of individual uncertainties.
4. Conditioning Reduces Entropy
. Knowing something about Y can only reduce (or maintain) uncertainty about X. This is related to Mutual Information.
Worked Examples
Example 1: Fair Coin
Example 2: Fair Die (6 sides)
Each outcome has probability .
Example 3: English Text
Letters in English are not uniformly distributed. 'e' appears ~12%, 'z' appears ~0.07%.
Shannon estimated English text has about 1-1.5 bits per character, far less than the 4.7 bits needed for uniform 26 letters. This redundancy enables compression!
Connection to Coding Theory
Entropy is not just abstract. It directly tells us the minimum number of bits needed to encode messages from a source.
Shannon's Source Coding Theorem
You cannot compress data below its entropy. On average, you need at least bits per symbol.
Example: A source with entropy 2 bits/symbol cannot be compressed below 2 bits/symbol on average, no matter how clever the algorithm.
Huffman Coding
Assign short codes to frequent symbols, long codes to rare symbols. This is how ZIP, JPEG, and MP3 work. The average code length approaches entropy.
ML Applications
Cross-Entropy Loss
The most common loss function for classification. It measures how well a model's predicted distribution matches the true distribution . See the Cross-Entropy page for details.
Decision Trees (Information Gain)
Trees choose splits that maximize Information Gain = reduction in entropy. The feature that most reduces uncertainty is chosen. See Information Gain.
Variational Autoencoders (KL Divergence)
VAEs use KL Divergence to measure the difference between the learned latent distribution and a prior (usually Gaussian).