S. Roy

Blog Post

Cheatsheet: Soft Metrics

Perplexity, negative log-likelihood, entropy, KL divergence, and cross-entropy annotated term-by-term — what each measures, how they relate, and what values to expect during training.

Views: 6 min readCite

These metrics appear throughout pre-training, evaluation, and alignment. They are closely related — perplexity is a transformation of NLL, NLL is per-token cross-entropy, entropy is cross-entropy of a distribution with itself — but each has a distinct interpretation. This cheatsheet annotates all of them and shows the relationships explicitly.


Negative log-likelihood (NLL)

NLL=logPθ(x1,,xT)=t=1TlogPθ(xtx<t)\text{NLL} = -\log P_\theta(x_1, \ldots, x_T) = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})

The negative log-probability the model assigns to the observed sequence. Lower is better — a model that confidently predicts each token correctly has high Pθ(xtx<t)P_\theta(x_t \mid x_{<t}) and thus low NLL.

Why the chain rule factorisation? The joint probability of a sequence factors exactly as the product of conditional probabilities — no approximation. An autoregressive LM computes exactly this: it predicts each token given all previous ones.

Units: nats (when using natural log) or bits (when using log2\log_2). The convention in PyTorch and most ML papers is nats, which is what F.cross_entropy returns.

NLL vs loss: what F.cross_entropy returns during training is the average NLL per token — so training loss ≈ mean per-token NLL. See NLLPerplexityCurve.


Cross-entropy

H(p,q)=xp(x)logq(x)H(p, q) = -\sum_x p(x) \log q(x)

The expected negative log-probability under qq when samples are drawn from pp. In LLM training, pp is the empirical data distribution (one-hot over the true next token) and q=πθq = \pi_\theta:

H(pdata,πθ)=Expdata ⁣[logπθ(x)]H(p_{\text{data}}, \pi_\theta) = -\mathbb{E}_{x \sim p_{\text{data}}}\!\left[\log \pi_\theta(x)\right]

Since pdatap_{\text{data}} is one-hot, this collapses to the NLL of the true token — exactly what F.cross_entropy computes. The name "cross-entropy loss" and "NLL loss" refer to the same quantity in language modelling.

Relation to entropy:

H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\text{KL}}(p \| q)

Cross-entropy = entropy of pp + KL divergence from pp to qq. Minimising cross-entropy is equivalent to minimising KL divergence when H(p)H(p) is fixed (which it is, since the data distribution doesn't change during training). See CrossEntropyDecomposition.


Entropy

H(p)=xp(x)logp(x)=Exp ⁣[logp(x)]H(p) = -\sum_x p(x) \log p(x) = -\mathbb{E}_{x \sim p}\!\left[\log p(x)\right]

The expected surprise under the distribution itself — how uncertain pp is. A uniform distribution over VV tokens has maximum entropy logV\log V; a delta distribution (always the same token) has entropy 0.

Entropy of an LLM's next-token distribution:

H(πθ(x<t))=v=1Vπθ(vx<t)logπθ(vx<t)H(\pi_\theta(\cdot \mid x_{<t})) = -\sum_{v=1}^{V} \pi_\theta(v \mid x_{<t}) \log \pi_\theta(v \mid x_{<t})

This is what the PPO/GRPO entropy bonus maximises — it prevents the policy from collapsing to a single deterministic token sequence.

What entropy values mean in practice:

Entropy (nats)Interpretation
~0Effectively deterministic — model is certain
~2–4Typical for well-trained LLMs on fluent continuations
~logV\log V ≈ 10–11Uniform over vocabulary — model has learned nothing

High entropy at inference = uncertain or incoherent output. Low entropy = confident, possibly repetitive. Entropy collapse during RL training (entropy → 0) is the canonical sign of mode collapse. See EntropyCollapseDemo.


Perplexity

PPL=exp ⁣(1Tt=1TlogPθ(xtx<t))=exp(mean per-token NLL)\text{PPL} = \exp\!\left(\frac{1}{T} \sum_{t=1}^{T} -\log P_\theta(x_t \mid x_{<t})\right) = \exp(\text{mean per-token NLL})

Perplexity is the exponential of mean per-token NLL. A model with perplexity kk is, on average, as uncertain about each token as a uniform distribution over kk options.

Why exponentiate? NLL accumulates over sequence length — a 1000-token sequence has 1000× the NLL of a single token. Per-token NLL normalises for length, and the exponential converts from log-space back to a "number of equally-likely choices" that has an intuitive interpretation.

Reading perplexity values:

PPLInterpretation
~1Perfect prediction — impossible in practice
~10–20Strong language model on its training domain
~50–100Reasonable model or out-of-domain text
~VV ≈ 50,000Uniform over vocabulary — random baseline

Perplexity is not comparable across tokenisers. A model with a 32K vocabulary and a model with a 100K vocabulary will have different perplexities on the same text, because the token boundaries are different — each "token" covers a different amount of linguistic content. Always compare perplexity numbers produced by the same tokeniser. See PerplexityUnitConverter.

Relationship to NLL:

PPL=eNLL/TNLL=Tlog(PPL)\text{PPL} = e^{\text{NLL}/T} \qquad \Leftrightarrow \qquad \text{NLL} = T \cdot \log(\text{PPL})

A drop in training loss of 0.1 nats/token corresponds to a multiplicative change in perplexity of e0.10.905e^{-0.1} \approx 0.905 — roughly a 10% reduction. Early in training (high loss), the same absolute loss drop produces much larger perplexity reductions than late in training (low loss).


KL divergence

DKL(pq)=xp(x)logp(x)q(x)=Exp ⁣[logp(x)q(x)]D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim p}\!\left[\log \frac{p(x)}{q(x)}\right]

The expected log-ratio of pp to qq under samples from pp. Measures how much information is lost when using qq to approximate pp. Always 0\geq 0; equals 0 iff p=qp = q.

KL is asymmetric: DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) in general. The direction matters:

  • Forward KL DKL(pq)D_{\text{KL}}(p \| q): penalises qq for assigning zero probability anywhere pp has mass — qq must cover all of pp's support. Used in GRPO's KL penalty because it prevents the policy from ignoring modes of the reference.
  • Reverse KL DKL(qp)D_{\text{KL}}(q \| p): penalises qq for putting mass anywhere pp does not — qq tends to seek a mode of pp. Used in variational inference (ELBO).

See KLDirectionDemo for an interactive comparison.

KL from reference in RL training:

DKL(πθπref)=Eoπθ ⁣[logπθ(o)logπref(o)]D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) = \mathbb{E}_{o \sim \pi_\theta}\!\left[\log \pi_\theta(o) - \log \pi_{\text{ref}}(o)\right]

In practice estimated per-token and averaged. A healthy RL run keeps this below ~5 nats; above ~10 is a sign of reward hacking in progress. See KLPenaltyTradeoff.


Bits per character / bits per byte

BPB=NLL (nats)ln2num_bytes=NLL (bits)num_bytes\text{BPB} = \frac{\text{NLL (nats)}}{\ln 2 \cdot \text{num\_bytes}} = \frac{\text{NLL (bits)}}{\text{num\_bytes}}

BPB (bits per byte) normalises by the number of UTF-8 bytes rather than the number of tokens, making it comparable across tokenisers. A model predicting English text well achieves roughly 1.0–1.3 BPB. Shannon estimated the entropy of written English at about 1.0 bit/character — BPB near 1.0 means the model is approaching the theoretical limit of compressibility for that text.

BPC (bits per character) is the same metric when normalising by character count rather than byte count — equivalent for ASCII, different for multi-byte Unicode.


How they relate

H(p,q)cross-entropy=H(p)entropy of data+DKL(pq)model gap\underbrace{H(p, q)}_{\text{cross-entropy}} = \underbrace{H(p)}_{\text{entropy of data}} + \underbrace{D_{\text{KL}}(p \| q)}_{\text{model gap}} PPL=eH(pdata,πθ)/T\text{PPL} = e^{H(p_{\text{data}}, \pi_\theta) / T} NLL=TH(pdata,πθ)(summed, not averaged)\text{NLL} = T \cdot H(p_{\text{data}}, \pi_\theta) \quad \text{(summed, not averaged)} BPB=H(pdata,πθ)/ln2(in bits, per byte)\text{BPB} = H(p_{\text{data}}, \pi_\theta) / \ln 2 \quad \text{(in bits, per byte)}

A single forward pass computes the cross-entropy loss. From that scalar you can immediately derive perplexity (exponentiate), NLL (multiply by sequence length), and BPB (divide by ln2\ln 2 and by byte count). They are all the same quantity, just scaled and transformed for different readability purposes.

Cite this work

Generated from article front matter.

Roy, Swastik. (2025). Cheatsheet: Soft Metrics. S. Roy. https://swastikroy.me/blog/cheatsheet-soft-metrics

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.