Blog Post
Putting It All Together — From Vectors to Transformers
A tour through the whole series: how vectors, matrices, eigendecomposition, SVD, and least squares combine to explain the mathematical machinery inside modern ML systems — from PCA to attention to gradient descent.
Views: –6 min readCite
This is the final post in the series. No new math — just synthesis. We'll walk through a complete story from raw data to transformer, with each step grounded in the concepts we've built.
1. Data Lives in a Vector Space
Every piece of data is a vector. An image with pixels is a vector in . A sentence tokenized into tokens is a sequence of vectors in (one per token, after embedding).
In Post 1 we defined vectors, their geometry, and what it means to add them and scale them. In Post 2 we saw that vector spaces are the natural habitat for data — closed under addition and scaling, spanned by a basis.
The key insight: the dimension of the space is the number of features. Working in is conceptually the same as working in — just with more axes.
2. Linear Maps Transform the Space
A matrix is a linear map from to . It stretches, rotates, and projects.
In Post 3 we studied how matrices compose (multiply), what the rank says about a matrix's image, and how to think of column space and null space geometrically.
In Post 4 we added the inner product — a way to measure angles and lengths in vector spaces. The inner product lets us define similarity as cosine of angle, which is the foundation of attention.
3. Eigendecomposition: Finding the Natural Axes
Given a matrix , its eigenvectors are the directions only scales — not rotates:
For symmetric matrices (covariance matrices, graph Laplacians, Hessians), the spectral theorem (Post 10) guarantees:
with orthogonal and real . The eigenvectors are the natural coordinate system of the transformation.
PCA is the canonical application. The covariance matrix is symmetric PSD. Its eigenvectors (principal components) are the directions of maximum variance. Project data onto the top- eigenvectors to reduce dimensionality while preserving maximum information.
4. SVD: Eigendecomposition for Non-Square Matrices
Most data matrices are non-square ( samples, features, rarely ). The SVD from Post 9 generalizes eigendecomposition:
- columns: directions in feature space (right singular vectors = principal components)
- diagonal: singular values =
- columns: directions in sample space
SVD gives us:
- Low-rank approximation: keep top- singular values → compress the matrix
- Pseudoinverse: → solve least-squares even when is singular
- LoRA: fine-tune only a low-rank delta
5. Solving Systems: Least Squares and Normal Equations
When we train a linear model, we minimize . The solution is the normal equations (Post 8):
The matrix is symmetric positive semidefinite. If it's invertible (positive definite), the unique solution is:
We covered the factorizations that make this numerically tractable in Post 7: LU for general systems, QR for least squares (more stable), Cholesky (Post 11) for PD systems.
6. Neural Networks: Composing Linear Maps with Nonlinearities
A neural network layer is where is a nonlinearity.
As we showed in Post 12:
- Without nonlinearities, deep networks collapse to one linear map
- Backpropagation is the chain rule applied to matrix operations: gradients flow through
- The Hessian being positive definite (Post 11) guarantees a unique minimum
Attention in transformers is:
is a matrix of inner products — measuring similarity between queries and keys. Softmax turns it row-stochastic. The output is a weighted sum of value vectors. Pure linear algebra, with softmax as the only nonlinearity.
7. The Full ML Pipeline
The ML Pipeline — Hover a node
| Step | Math | Post |
|---|---|---|
| Raw data | Vectors in | 1, 2 |
| Preprocessing | Linear maps, projections | 3, 4 |
| Dimensionality reduction | PCA = SVD of covariance | 5, 9 |
| Linear model training | Least squares, normal equations | 8 |
| Deep learning forward pass | Matrix multiplications + nonlinearities | 12 |
| Attention mechanism | Inner products + softmax | 12 |
| Fine-tuning (LoRA) | Low-rank matrix decomposition | 9, 12 |
| Optimization (Newton) | Hessian = PD matrix → unique min | 11 |
Concept Map
Concept Map — Click a node to see connections
Quick Review Quiz
Test yourself on the key ideas from the series:
Quick Review Quiz
1 / 5What does the spectral theorem say about a real symmetric matrix A?
Where to Go Next
You now have a working foundation in the linear algebra that underlies modern ML. Here are natural next steps:
Differential equations — Many dynamical systems (RNNs, differential transformers, physics-informed networks) are governed by ODEs/PDEs. The eigenvalues of the system matrix determine stability. Highly recommended: Strang's Differential Equations and Linear Algebra.
Fourier analysis — The discrete Fourier transform (DFT) is a linear transformation: where is a unitary matrix (orthogonal over ). Convolution in the spatial domain = element-wise multiplication in the Fourier domain. CNNs are intimately related.
Probability and statistics — The multivariate Gaussian distribution is entirely described by its mean vector and covariance matrix (symmetric PSD). Maximum likelihood estimation leads to least squares. Bayesian inference involves PD matrices everywhere.
Numerical linear algebra — How do you actually compute eigenvalues for a matrix? Power iteration, Lanczos, randomized SVD. The algorithms matter enormously at scale.
Convex optimization — Gradient descent, Newton's method, proximal methods — all build on the linear algebra of Hessians, gradients (vectors), and constraint matrices. Boyd & Vandenberghe's Convex Optimization is the standard reference.
Closing Thought
Linear algebra is the operating system of machine learning. Every tensor operation, every gradient update, every attention pattern is a linear algebraic object. Understanding what the matrices are doing — geometrically, not just numerically — is what separates someone who can train models from someone who can design them.
Thanks for following along. The math is only the beginning.