Modules
06/07
Information Theory

Contents

Perplexity

The standard evaluation metric for language models. How confused is your model?

Introduction

When you read "The cat sat on the ___", your brain instantly predicts words like "mat", "floor", or "couch". You are confident about what comes next. But for "I really want to ___", many words could follow: eat, sleep, dance, leave, scream. You are less certain.

Perplexity measures exactly this: how "surprised" or "confused" a language model is when predicting the next word. It quantifies the model's uncertainty across an entire text.

Why Do We Need Perplexity?

Language models like GPT output probability distributions over thousands of possible next tokens. But how do we compare models? We need a single number that captures "how good are these predictions overall?"

We could use raw probability, but multiplying thousands of small probabilities gives tiny numbers like 0.0000000001. Perplexity converts this into an interpretable scale.

The connection to information theory: Entropy measures uncertainty in bits. Perplexity transforms entropy into "effective number of choices" by exponentiating: PPL=2H\text{PPL} = 2^H. This gives us a more intuitive interpretation.

The Core Idea

Perplexity = "How many equally likely choices is the model effectively picking from?"

If a language model has perplexity 50, it is "as confused as if it had to choose uniformly among 50 options at each position." The model's complex probability distribution is equivalent to rolling a 50-sided die.

Lower perplexity = higher confidence = better predictions. A model that "knows" what comes next has low perplexity.

Simple Analogy: The Guessing Game

Imagine playing 20 questions, but instead of yes/no, you guess the next word in a sentence.

Expert Player (Low PPL)

"The capital of France is ___" - You immediately say "Paris" with 95% confidence. You only need to consider ~1-2 realistic options.

Confused Player (High PPL)

"She decided to ___" - Could be anything! eat, leave, stay, cry, laugh... You are effectively choosing from 100+ options.

Perplexity Scale Reference

1
Perfect Model
Always 100% sure
(Unrealistic for natural language)
~20
Good LLM
GPT-level on benchmarks
(State-of-the-art range)
V
Random Guess
V = vocab size (~50K)
(Uniform over all tokens)

Intuition: Effective Choices

Imagine a model predicting the next word in a sentence. At each position, it assigns probabilities to all possible next tokens.

Low Perplexity (Good)

"The capital of France is ___"

Model is 95% confident it's "Paris." PPL1.2\text{PPL} \approx 1.2

High Perplexity (Confused)

"I really like to ___"

Could be eat, sleep, dance, run, etc. PPL50+\text{PPL} \approx 50+

Real models output complex probability distributions. Perplexity converts that distribution into an equivalent "uniform over N choices" number.

Perplexity as "Effective Choices"

Perplexity measures how confused the model is. It represents the number of "equally likely options" the model is choosing from.

Distribution Sharpness (Temperature)1.00
Sharp (Confident)Flat (Uncertain)

Actual Probabilities

10 possible tokens

Perplexity Equivalent

5.27 "effective" tokens
Entropy (bits)
2.40
Perplexity
5.27
PPL=2Entropy=22.405.27\text{PPL} = 2^{\text{Entropy}} = 2^{2.40} \approx 5.27

The Formula

For a single distribution with entropy HH:

PPL=2H\text{PPL} = 2^{H}
Perplexity is 2 raised to the power of entropy (in bits).

For a sequence of NN tokens, we use the average negative log-likelihood:

PPL=exp(1Ni=1Nlnpmodel(xix<i))\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^N \ln p_{model}(x_i | x_{<i})\right)

Bits vs Nats

Using log2\log_2 gives entropy in bits and PPL=2H\text{PPL} = 2^H. Using ln\ln gives entropy in nats (natural log) and PPL=eH\text{PPL} = e^H. Deep learning frameworks typically use nats, but the concepts are identical.

Why Exponentiate?

Entropy is additive: H(A,B)=H(A)+H(B)H(A, B) = H(A) + H(B) for independent events.

Perplexity is multiplicative: PPL(A,B)=PPL(A)×PPL(B)\text{PPL}(A, B) = \text{PPL}(A) \times \text{PPL}(B). This makes it easier to interpret as "number of choices."

Interactive: Word Prediction

See how a language model assigns probabilities to next words. Different contexts lead to different levels of certainty.

Next Word Prediction

Simulating language model probability distribution

$predict_next_token--context"The cat sat on the"
Input:The cat sat on the
Token Probabilities
mat
35.0%TOP
floor
20.0%
couch
15.0%
bed
10.0%
chair
8.0%
table
5.0%
ground
4.0%
roof
3.0%
Uncertainty
2.58bits
Perplexity
6.0

Equivalent to choosing from 6.0 options.

Step-by-Step Calculation

Calculate perplexity for a simple 4-word sequence:

Sequence
"The cat sat down"

Step 1: Get Model Probabilities

  • P("The" | <BOS>) = 0.15
  • P("cat" | "The") = 0.05
  • P("sat" | "The cat") = 0.20
  • P("down" | "The cat sat") = 0.25

Step 2: Compute Log Probabilities

  • ln(0.15)=1.897\ln(0.15) = -1.897
  • ln(0.05)=2.996\ln(0.05) = -2.996
  • ln(0.20)=1.609\ln(0.20) = -1.609
  • ln(0.25)=1.386\ln(0.25) = -1.386

Step 3: Average Negative Log-Likelihood

NLL=1.897+2.996+1.609+1.3864=1.972 nats\text{NLL} = \frac{1.897 + 2.996 + 1.609 + 1.386}{4} = 1.972 \text{ nats}

Step 4: Exponentiate

PPL=e1.9727.18\text{PPL} = e^{1.972} \approx \mathbf{7.18}

The model is "as confused as choosing from 7 equally likely words" on average.

LLM Benchmarks

Perplexity is the standard metric for comparing language models. Here are typical values on popular benchmarks:

ModelWikiText-103Penn TreebankParameters
GPT-2 Small37.565.9117M
GPT-2 Large22.140.3762M
GPT-3 *~20~20175B
LLaMA 2 *~7-70B

Important Note

* Values marked with asterisk are community estimates; official benchmarks not published.

Perplexity values are only comparable when using the same tokenizer and same test set. A model with BPE tokenization cannot be directly compared to one with word-level tokenization.

Bits Per Character (BPC)

For character-level models, we often use Bits Per Character instead of perplexity. This is tokenizer-independent.

BPC=Cross-Entropy Lossln(2)\text{BPC} = \frac{\text{Cross-Entropy Loss}}{\ln(2)}

BPC to PPL

PPL=2BPC\text{PPL} = 2^{\text{BPC}}

Typical Values

Good character LM: 1.0-1.51.0\text{-}1.5 BPC

Connection to Entropy

Minimizing Perplexity is mathematically equivalent to minimizing Cross-Entropy loss.

Entropy

HH
Bits needed

Cross-Entropy

H(p,q)H(p, q)
Training loss

Perplexity

2H2^H
Effective choices

Key Relationship

PPL=exp(Cross-Entropy Loss)\text{PPL} = \exp(\text{Cross-Entropy Loss}). When your training loss goes down by 0.1, your perplexity is multiplied by e0.10.9e^{-0.1} \approx 0.9.

Limitations & Caveats

1. Not Task-Specific

Low perplexity measures how well a model predicts the next token, but this does not directly translate to downstream task performance. A model can be excellent at predicting common word sequences yet fail at:

  • Reasoning tasks: Predicting "2" after "1 + 1 =" requires understanding, not just pattern matching
  • Factual accuracy: A model might confidently predict plausible but incorrect facts
  • Instruction following: Low PPL on text does not mean the model follows user instructions well
  • Safety: Fluent generation of harmful content would still show low perplexity

This is why modern LLM evaluation uses task-specific benchmarks (MMLU, HumanEval, etc.) alongside perplexity.

2. Tokenizer Dependent

Perplexity is computed per-token, so the choice of tokenizer fundamentally affects the score. Consider encoding "unhappiness":

Tokenizer A (3 tokens):
["un", "happi", "ness"]
PPL measured over 3 predictions
Tokenizer B (1 token):
["unhappiness"]
PPL measured over 1 prediction

A model using Tokenizer B appears to have lower perplexity because it makes fewer (but harder) predictions. Neither score is "wrong," but they cannot be directly compared.

Solution: Use bits-per-character (BPC) or bits-per-byte for tokenizer-agnostic comparison.

3. Domain Sensitivity

Language models learn the statistical patterns of their training data. When evaluated on a different domain, perplexity can change dramatically:

Model trained onNews PPLCode PPLMedical PPL
News articles~25~150~80
GitHub code~90~15~200

A "good" perplexity on one domain means nothing for another. Rare terminology, different syntax patterns, and specialized jargon all increase perplexity.

Always evaluate on data representative of your target use case.

4. Memorization vs Understanding

A model that has seen the test set during training can achieve artificially low perplexity by memorizing rather than learning generalizable patterns:

PPL = 1.0on test set

Model assigns 100% probability to each correct token. Perfect score, but is it understanding or memorization?

Signs of memorization over understanding:

  • Large gap between train and test perplexity (overfitting)
  • Model reproduces training examples verbatim when prompted
  • Poor performance on paraphrased or novel formulations
  • Fails on out-of-distribution inputs despite low benchmark PPL

Mitigations: Use held-out test sets, check for data contamination, evaluate on diverse benchmarks.

ML Applications

LLM Evaluation

GPT, LLaMA, and other models are benchmarked on WikiText-103, PTB, and other datasets using perplexity. Lower is always better.

Machine Translation

Decoder perplexity measures fluency of generated translations. Often combined with BLEU for quality assessment.

t-SNE Hyperparameter

t-SNE uses "perplexity" to define effective neighborhood size. Similar concept: how many neighbors to consider.

Speech Recognition

Language model perplexity affects ASR accuracy. Lower LM perplexity typically improves word error rate (WER).