Probability, likelihood, log-likelihood, NLL, expectation, entropy, cross-entropy, KL divergence, and perplexity are constantly confused for one another. They are not the same thing — but they are all the same thing seen from different angles. This is the definitive deep-dive that names each one precisely and shows exactly how they connect for LLMs.
Without nonlinearity, stacking layers collapses to a single matrix multiplication. Activation functions break that linearity — and the choice of which one determines expressivity, gradient flow, and training efficiency.
How do you measure whether a model is actually good? The answer is a set of metrics — precision, recall, F1, perplexity, calibration, confidence intervals — each measuring something different and failing in a different way.
Information theory gives precise answers to questions like: how much does the context tell you about the next token? What information is preserved in a representation? Why does compression and prediction point to the same objective?
Every forward pass is a sequence of matrix multiplications. Understanding what those matrices do — rotate, scale, project — is the foundation for understanding why transformers work.
The loss function is the specification. Everything the model learns is in service of minimizing it. Here's the math behind every major loss used in LLM training and fine-tuning.
Training a neural network is an optimization problem: minimize a loss function over billions of parameters. The journey from vanilla gradient descent to Adam reveals why each step was necessary.
A language model is a probability distribution over sequences. Training it means pushing that distribution toward the data distribution. The math of how you measure and minimize that gap is what this post covers.
A single training step involves linear algebra, probability, information theory, optimization, and statistical estimation — all at once. Here's how the pieces fit.
Language models are probabilistic systems. Understanding the statistical machinery behind maximum likelihood estimation, Bayesian inference, and sampling algorithms clarifies why training and decoding work the way they do.
LoRA works because weight matrices are empirically low-rank. SVD explains why, and the math of low-rank approximation explains how LoRA exploits it to fine-tune billion-parameter models on a laptop.