Blog Post
From Likelihood to Perplexity: One Unified Reference
Probability, likelihood, log-likelihood, NLL, expectation, entropy, cross-entropy, KL divergence, and perplexity are constantly confused for one another. They are not the same thing — but they are all the same thing seen from different angles. This is the definitive deep-dive that names each one precisely and shows exactly how they connect for LLMs.
Views: –18 min readCite
The rest of this series tells a story in order, and Part 2 and Part 8 introduce most of the quantities below in the flow of that narrative. This post is the deep-dive reference companion: it slows down on the cluster of ideas that get confused for one another more than any other in machine learning — probability versus likelihood, log-likelihood, negative log-likelihood, expectation, entropy, cross-entropy, KL divergence, and perplexity — and pins down what each one is, with worked numbers, and exactly how they collapse into a single equation for language models.
A note on units before anything else, because mixing logarithm bases is the single most common error in this material. Throughout, means natural log and quantities are in nats, unless a formula explicitly converts to bits. The only consequence that matters: perplexity is when the NLL is in nats, and bits-per-character involves a division by to convert nats to bits. Get the base wrong and every number is off by a factor of .
Probability and likelihood are the same formula read two different ways
Start with the distinction that everything else rests on. A probability and a likelihood are computed from the identical expression — the chance of observing data under a model with parameters — but they hold different things fixed and vary different things, and that difference is not pedantic.
A probability answers "given this fixed model, how probable are the various outcomes?" — you sweep over with held constant, and the values sum or integrate to one over all , because something must happen. A likelihood answers "given this fixed observation, how well do various models explain it?" — you sweep over with held constant at the data you actually saw, and it emphatically does not integrate to one over . The likelihood is not a probability distribution over parameters; it is a function that scores parameters by how plausibly each one would have generated the fixed data.
| Probability | Likelihood | |
|---|---|---|
| What's fixed | the model | the observed data |
| What varies | the outcome | the parameters |
| Normalizes to 1 over | (always) | nothing — it need not sum to 1 over |
| Question it answers | "how likely is this outcome?" | "how well does this model explain what I saw?" |
| Used for | sampling, decoding, prediction | fitting, training, MLE |
Maximum likelihood estimation (MLE) is the act of choosing the parameters that maximize the likelihood of the data you observed.
For a language model the "observation" is a corpus, and the model factors the probability of a sequence by the chain rule into a product of next-token conditionals — this is the entire probabilistic content of "autoregressive."
Reading this left-to-right with fixed gives the probability the model assigns to a particular text. Reading it with the text fixed and varying gives the likelihood of the parameters — and training is MLE: search for the that makes the observed corpus as likely as possible.
Log-likelihood: why we never optimize the likelihood directly
Nobody maximizes the product above as written, for three independent reasons, all fixed by taking a logarithm.
Products become sums. The log of a product is a sum of logs, which turns a fragile chained multiplication into a robust accumulation.
It avoids numerical underflow. This is not a theoretical nicety. Suppose each token gets probability around — generous for a vocabulary of tens of thousands — and the sequence is tokens. The product is , which is flatly zero in IEEE double precision (the smallest positive double is ). The sum of logs is nats, a perfectly ordinary number. Without the log, the loss of any realistic sequence would round to zero and no gradient would flow.
The argmax is preserved. The logarithm is strictly monotonic, so exactly — taking the log changes the value of the objective but never where its maximum sits, which is the only thing MLE cares about.
Because sequences differ in length, the comparable quantity is the average log-likelihood per token, which is what you should picture whenever someone mentions "the loss" without qualification:
Negative log-likelihood is the training loss
MLE maximizes log-likelihood, but every optimizer in deep learning minimizes. Flip the sign and you get the negative log-likelihood (NLL), which is the loss reported on every pretraining dashboard ever made.
The "negative" carries no extra meaning beyond "we minimize instead of maximize" — minimizing NLL and maximizing likelihood are the same operation said two ways. What is genuinely worth internalizing is that this NLL is exactly cross-entropy loss with a one-hot target. Cross-entropy between a target distribution and a predicted distribution is
At a single training position the target is one-hot: it puts all its mass on the token that actually came next and zero everywhere else. Every term in the sum vanishes except the one at , where , collapsing the whole expression to
which is precisely the per-token NLL. This is why nn.CrossEntropyLoss in PyTorch is the language-modeling loss — it fuses a log_softmax (turn logits into log-probabilities) with nll_loss (pick out the negative log-prob of the correct class) in a single numerically stable kernel:
import torch, torch.nn.functional as F
# logits: (batch, seq_len, vocab) targets: (batch, seq_len) of token ids
logits = model(input_ids)
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), # (B*T, V)
targets.view(-1), # (B*T,)
)
# identical, spelled out — log_softmax then gather the true token's log-prob:
logp = F.log_softmax(logits, dim=-1)
nll = -logp.gather(-1, targets.unsqueeze(-1)).squeeze(-1).mean()
assert torch.allclose(loss, nll)So NLL = cross-entropy loss = what the optimizer minimizes, and the three names refer to one number.
Expectation is the glue
Everything above is a finite sum over observed tokens, but the meaning lives in an expectation. For a function and a distribution , the expectation is the probability-weighted average,
and the training loss is exactly the expected surprise of the model under the data distribution:
The empirical average over the corpus, , is a Monte-Carlo estimate of this expectation — the corpus tokens are samples from , and averaging over them estimates the average over the true distribution. This estimator view is the bridge between "the finite loss I compute" and "the population quantity I care about," and it is the lens through which the whole field reasons: the ELBO that trains a VAE, the policy-gradient and REINFORCE estimators behind RL for LLMs, and the KL penalties below are all expectations of log-probabilities, estimated from samples. One identity from this family recurs constantly — the expected log-probability of a distribution under itself is the negative of its entropy:
which is the door into information theory.
Entropy: the average surprise of a distribution under itself
Shannon entropy measures how uncertain a distribution is — the expected surprise averaged over 's own outcomes.
A peaked, confident distribution has low entropy (few surprises); a flat, uncertain one has high entropy. Two anchors make this concrete, computed in bits () so the numbers are familiar:
- A fair coin: bit. Maximum uncertainty for two outcomes.
- A biased coin, : bits. Knowing it usually lands heads, you are far less surprised on average.
- A uniform distribution over outcomes: , the maximum possible. For a vocabulary of tokens, bits — a model that has learned nothing.
A distribution that places all its mass on one outcome has : nothing it produces is ever a surprise. For continuous distributions the analogue is differential entropy, , which can be negative (a tight Gaussian concentrated in a region of width has negative differential entropy) — a reminder that differential entropy is not a straightforward limit of the discrete kind. In an LLM, the entropy of the model's output distribution at a position is a direct readout of how sure it is about the next token: low entropy at "the capital of France is ___", high entropy at "my favorite color is ___".
Cross-entropy: the cost of using the wrong distribution
Entropy scores a distribution against itself. Cross-entropy scores the cost of encoding draws from using a code optimized for — the average surprise you incur when the data truly comes from but you model it as .
Its decomposition is the most important identity in this entire post:
Cross-entropy is the irreducible entropy of the data plus a non-negative penalty for the mismatch between and . Because that penalty (the KL divergence below) is always , cross-entropy is always at least the entropy, , with equality if and only if . Worked example with and a model , in bits:
The model pays extra bits per symbol for being wrong about the bias. Training an LLM minimizes , and since is a constant the optimizer cannot touch, minimizing cross-entropy is identical to minimizing — pulling the model's distribution toward the data's.
The one-hot vs. data-distribution confusion, resolved
Here is the subtlety that trips up careful readers, and it is worth a paragraph rather than a footnote. The section on NLL said the per-token target is one-hot, which makes and the per-token loss equal to pure KL with no floor. This section says the loss bottoms out at the nonzero data entropy . Both are true, because they describe cross-entropy against two different 's.
Per individual token, the target distribution really is the one-hot — it has zero entropy, so the per-token loss is all KL and could in principle reach zero if the model put all its mass on . But natural language is genuinely stochastic: given a context, many next tokens are valid, so the true conditional is not one-hot — it has real entropy. Those one-hot targets are samples drawn from , and the corpus-averaged loss is a Monte-Carlo estimate of the cross-entropy against the full distribution. So the per-token loss can hit zero only for a token that was truly deterministic; in aggregate the loss cannot fall below — the irreducible entropy of language, the few bits of genuine unpredictability per token that no model can remove. A perfect model drives the KL term to zero and lands exactly at , not at zero. The naive reading "targets are one-hot so so the loss goes to zero" is precisely backwards, and the resolution is that the one-hot you compute against per token and the entropic you estimate against in aggregate are different distributions wearing the same letter.
KL divergence: a directional gap between distributions
The penalty term, the Kullback–Leibler divergence, is the number of extra nats you pay for using when the truth is .
By Gibbs' inequality it is always and is only when , which makes it the natural measure of a gap. But it is not a distance: it is not symmetric, and the asymmetry has real consequences. Continuing the worked example, bits but the reverse is
The two directions behave differently enough to deserve names:
| Forward KL | Reverse KL | |
|---|---|---|
| Expectation taken under | the true | the model |
| Penalizes | being small where is large | being large where is small |
| Behavior | mass-covering / mean-seeking — stretches to cover every mode of | mode-seeking / zero-forcing — collapses onto one mode and ignores the rest |
| Where it shows up | maximum-likelihood training (this is what minimizing cross-entropy does) | variational inference, some RL objectives |
Forward KL is what LLM pretraining minimizes — and its mass-covering nature is why a base model will happily assign probability to many continuations rather than committing to one. KL divergence also appears explicitly as a loss term, not just a diagnostic, in two places worth knowing (both detailed in Part 4): the RLHF penalty that leashes a fine-tuned policy to its reference model so it cannot drift into reward-hacked gibberish, and the VAE regularizer that pulls the learned latent posterior toward a chosen prior.
Perplexity: the loss on an interpretable scale
NLL in nats is the quantity that matters, but nats means nothing to intuition. Perplexity exponentiates it back onto the scale of "number of options."
The interpretation is the effective branching factor: a perplexity of means the model is, on average, as uncertain at each step as if it were choosing uniformly among equally likely tokens. This follows directly from the uniform case — a uniform distribution over options has , so . The exponential makes the relationship between loss and perplexity violently nonlinear: shaving NLL from to nats drops PPL from to , while drops it from to — the same loss improvement buys very different perplexity gains depending on where you are. Lower is always better: lower PPL lower NLL the model assigns higher probability to the held-out data.
The base discipline from the top of the post matters most here. Perplexity is only when NLL is in nats. The bit-denominated cousins convert explicitly:
So nats gives , which equals — the same number, as it must be. Bits-per-character is the form you report for character-level work; bits-per-byte (BPB) divides the total nats by times the total bytes, , and is the tokenizer-agnostic metric used for fair cross-model comparison precisely because bytes are a fixed unit while tokens are not.
Perplexity's traps
Perplexity is a sharp metric and a dull instrument at once, and four caveats keep it honest.
It is corpus-dependent. A perplexity number is a statement about a particular text distribution. GPT-2's perplexity on Penn Treebank ( for the 1.5B model) and on WikiText-103 () are both correct and not comparable to each other — they measure surprise on different distributions.
Comparing across tokenizers is invalid, and the canonical "headline" comparison is exactly this trap. It is tempting to line up GPT-2 1.5B at on WikiText-103 against LLaMA-3 8B at on WikiText-2 and conclude LLaMA is " less perplexed." That conclusion is unsound on three counts: different split (WikiText-103 vs WikiText-2), and more fundamentally different granularity — GPT-2's number is reported as word-level perplexity (the paper de-tokenizes its BPE output back to a per-word basis precisely so the figure is comparable), while the LLaMA-3 figure is subword/token-level under the LLaMA tokenizer. Perplexity is defined per modeling unit, and shrinks when each unit predicts less text, so a model with a finer tokenizer posts a lower per-token perplexity for free. The valid way to compare two models with different tokenizers is to route through a fixed unit — word-level perplexity, or better, bits-per-byte, which is why frontier reports increasingly quote BPB instead of raw PPL.
Conditional perplexity is much lower than unconditional. The perplexity of is far below the perplexity of — conditioning removes uncertainty, so a number is only meaningful alongside what it was (or wasn't) conditioned on.
It can be gamed without understanding. A model can lower perplexity by playing it safe — favoring high-frequency tokens, hedging, and repeating — without getting any better at the things we care about. This is the famous, repeatedly observed gap: perplexity and downstream task performance are correlated but not the same, and below a certain loss the correlation frays. Perplexity measures compression, and a fluent parrot compresses text well. It is a necessary signal and an insufficient one — which is why evaluation needs calibration (a model can have great NLL yet be confidently wrong, measured by Expected Calibration Error) and task benchmarks on top of it.
The unified picture
Every quantity in this post is one identity seen from a different side. For language-model training they all sit on a single line:
Read it left to right. The thing you compute — the average negative log-likelihood of the corpus — is cross-entropy between the data and the model. Cross-entropy splits into a part you cannot reduce (the entropy of language itself) and a part you can (the KL divergence from data to model). Training drives that KL toward zero, which is the same as maximizing likelihood, which is the same as minimizing description length and compressing the data as tightly as the model allows. And perplexity is just that loss exponentiated onto the scale of effective vocabulary so a human can read it.
| Quantity | Symbol | One-line meaning |
|---|---|---|
| Probability | how likely an outcome is, model fixed | |
| Likelihood | how well a model explains fixed data | |
| Log-likelihood | likelihood made additive and underflow-proof | |
| NLL | the training loss (minimize it) | |
| Entropy | average surprise of under itself | |
| Cross-entropy | average surprise of encoded as — the loss | |
| KL divergence | extra nats for using instead of — the error | |
| Perplexity | effective branching factor — the loss, human-readable |
None of these is the "real" one with the others as approximations. They are a single circle of ideas — a distribution, the surprise it assigns, the average of that surprise, the gap between two distributions, and that gap put on a readable scale — and a language model is the machine that spends all of training walking around that circle toward . The narrative version of where this fits among optimization, architecture, and estimation is the series finale, Putting It Together.