Blog Post
Cheatsheet: Soft Metrics
Perplexity, negative log-likelihood, entropy, KL divergence, and cross-entropy annotated term-by-term — what each measures, how they relate, and what values to expect during training.
Views: –6 min readCite
These metrics appear throughout pre-training, evaluation, and alignment. They are closely related — perplexity is a transformation of NLL, NLL is per-token cross-entropy, entropy is cross-entropy of a distribution with itself — but each has a distinct interpretation. This cheatsheet annotates all of them and shows the relationships explicitly.
Negative log-likelihood (NLL)
The negative log-probability the model assigns to the observed sequence. Lower is better — a model that confidently predicts each token correctly has high and thus low NLL.
Why the chain rule factorisation? The joint probability of a sequence factors exactly as the product of conditional probabilities — no approximation. An autoregressive LM computes exactly this: it predicts each token given all previous ones.
Units: nats (when using natural log) or bits (when using ). The convention in PyTorch and most ML papers is nats, which is what F.cross_entropy returns.
NLL vs loss: what F.cross_entropy returns during training is the average NLL per token — so training loss ≈ mean per-token NLL. See NLLPerplexityCurve.
Cross-entropy
The expected negative log-probability under when samples are drawn from . In LLM training, is the empirical data distribution (one-hot over the true next token) and :
Since is one-hot, this collapses to the NLL of the true token — exactly what F.cross_entropy computes. The name "cross-entropy loss" and "NLL loss" refer to the same quantity in language modelling.
Relation to entropy:
Cross-entropy = entropy of + KL divergence from to . Minimising cross-entropy is equivalent to minimising KL divergence when is fixed (which it is, since the data distribution doesn't change during training). See CrossEntropyDecomposition.
Entropy
The expected surprise under the distribution itself — how uncertain is. A uniform distribution over tokens has maximum entropy ; a delta distribution (always the same token) has entropy 0.
Entropy of an LLM's next-token distribution:
This is what the PPO/GRPO entropy bonus maximises — it prevents the policy from collapsing to a single deterministic token sequence.
What entropy values mean in practice:
| Entropy (nats) | Interpretation |
|---|---|
| ~0 | Effectively deterministic — model is certain |
| ~2–4 | Typical for well-trained LLMs on fluent continuations |
| ~ ≈ 10–11 | Uniform over vocabulary — model has learned nothing |
High entropy at inference = uncertain or incoherent output. Low entropy = confident, possibly repetitive. Entropy collapse during RL training (entropy → 0) is the canonical sign of mode collapse. See EntropyCollapseDemo.
Perplexity
Perplexity is the exponential of mean per-token NLL. A model with perplexity is, on average, as uncertain about each token as a uniform distribution over options.
Why exponentiate? NLL accumulates over sequence length — a 1000-token sequence has 1000× the NLL of a single token. Per-token NLL normalises for length, and the exponential converts from log-space back to a "number of equally-likely choices" that has an intuitive interpretation.
Reading perplexity values:
| PPL | Interpretation |
|---|---|
| ~1 | Perfect prediction — impossible in practice |
| ~10–20 | Strong language model on its training domain |
| ~50–100 | Reasonable model or out-of-domain text |
| ~ ≈ 50,000 | Uniform over vocabulary — random baseline |
Perplexity is not comparable across tokenisers. A model with a 32K vocabulary and a model with a 100K vocabulary will have different perplexities on the same text, because the token boundaries are different — each "token" covers a different amount of linguistic content. Always compare perplexity numbers produced by the same tokeniser. See PerplexityUnitConverter.
Relationship to NLL:
A drop in training loss of 0.1 nats/token corresponds to a multiplicative change in perplexity of — roughly a 10% reduction. Early in training (high loss), the same absolute loss drop produces much larger perplexity reductions than late in training (low loss).
KL divergence
The expected log-ratio of to under samples from . Measures how much information is lost when using to approximate . Always ; equals 0 iff .
KL is asymmetric: in general. The direction matters:
- Forward KL : penalises for assigning zero probability anywhere has mass — must cover all of 's support. Used in GRPO's KL penalty because it prevents the policy from ignoring modes of the reference.
- Reverse KL : penalises for putting mass anywhere does not — tends to seek a mode of . Used in variational inference (ELBO).
See KLDirectionDemo for an interactive comparison.
KL from reference in RL training:
In practice estimated per-token and averaged. A healthy RL run keeps this below ~5 nats; above ~10 is a sign of reward hacking in progress. See KLPenaltyTradeoff.
Bits per character / bits per byte
BPB (bits per byte) normalises by the number of UTF-8 bytes rather than the number of tokens, making it comparable across tokenisers. A model predicting English text well achieves roughly 1.0–1.3 BPB. Shannon estimated the entropy of written English at about 1.0 bit/character — BPB near 1.0 means the model is approaching the theoretical limit of compressibility for that text.
BPC (bits per character) is the same metric when normalising by character count rather than byte count — equivalent for ASCII, different for multi-byte Unicode.
How they relate
A single forward pass computes the cross-entropy loss. From that scalar you can immediately derive perplexity (exponentiate), NLL (multiply by sequence length), and BPB (divide by and by byte count). They are all the same quantity, just scaled and transformed for different readability purposes.