Tag: math

Blog Post·2024-06-19·18 min read

From Likelihood to Perplexity: One Unified Reference

Probability, likelihood, log-likelihood, NLL, expectation, entropy, cross-entropy, KL divergence, and perplexity are constantly confused for one another. They are not the same thing — but they are all the same thing seen from different angles. This is the definitive deep-dive that names each one precisely and shows exactly how they connect for LLMs.

math probability information-theory perplexity llm-training

Blog Post·2024-06-19·6 min read

Activation Functions: The Nonlinearity That Makes Neural Networks Work

Without nonlinearity, stacking layers collapses to a single matrix multiplication. Activation functions break that linearity — and the choice of which one determines expressivity, gradient flow, and training efficiency.

math activation-functions neural-networks llm-training

Blog Post·2024-06-19·8 min read

Evaluation Metrics: Precision, Recall, Calibration, and Confidence

How do you measure whether a model is actually good? The answer is a set of metrics — precision, recall, F1, perplexity, calibration, confidence intervals — each measuring something different and failing in a different way.

math evaluation metrics statistics llm

Blog Post·2024-06-19·6 min read

Information Theory for LLMs: Mutual Information, Entropy, and What Models Learn

Information theory gives precise answers to questions like: how much does the context tell you about the next token? What information is preserved in a representation? Why does compression and prediction point to the same objective?

math information-theory entropy representation-learning

Blog Post·2024-06-19·6 min read

Linear Algebra for LLMs: Vectors, Matrices, and What They Do

Every forward pass is a sequence of matrix multiplications. Understanding what those matrices do — rotate, scale, project — is the foundation for understanding why transformers work.

math linear-algebra transformers

Blog Post·2024-06-19·5 min read

Loss Functions: What You Optimize Is What You Get

The loss function is the specification. Everything the model learns is in service of minimizing it. Here's the math behind every major loss used in LLM training and fine-tuning.

math loss-functions llm-training alignment

Blog Post·2024-06-19·6 min read

Optimization for LLMs: Gradient Descent to Adam

Training a neural network is an optimization problem: minimize a loss function over billions of parameters. The journey from vanilla gradient descent to Adam reveals why each step was necessary.

math optimization gradient-descent adam llm-training

Blog Post·2024-06-19·5 min read

Probability for LLMs: Distributions, Entropy, and KL Divergence

A language model is a probability distribution over sequences. Training it means pushing that distribution toward the data distribution. The math of how you measure and minimize that gap is what this post covers.

math probability information-theory llm-training

Blog Post·2024-06-19·7 min read

Putting It Together: The Mathematics of a Training Run

A single training step involves linear algebra, probability, information theory, optimization, and statistical estimation — all at once. Here's how the pieces fit.

math llm-training transformers synthesis

Blog Post·2024-06-19·7 min read

Statistical Foundations: Distributions, Estimation, and Sampling

Language models are probabilistic systems. Understanding the statistical machinery behind maximum likelihood estimation, Bayesian inference, and sampling algorithms clarifies why training and decoding work the way they do.

math statistics sampling bayesian llm-training

Blog Post·2024-06-19·8 min read

SVD and Low-Rank Approximation: The Math Behind LoRA

LoRA works because weight matrices are empirically low-rank. SVD explains why, and the math of low-rank approximation explains how LoRA exploits it to fine-tune billion-parameter models on a laptop.

math svd lora fine-tuning linear-algebra