Blog Post
SVD: The Geometry of Every Matrix
Every matrix, no matter how ugly, factors into a rotation, a scaling, and another rotation. That single fact — the singular value decomposition — is the engine behind PCA, image compression, latent semantic analysis, and the low-rank tricks that let us fine-tune giant models cheaply.
Views: –9 min readCite
The first post in this series was about one meaning of the word rank — the ordering of competitors by strength. This post and the next are about the other meaning, the rank of a matrix, and the bridge between the two is the most useful factorization in applied mathematics: the singular value decomposition. Its claim is audacious in its generality. Take any matrix at all — rectangular, singular, full of noise, it does not matter — and it can be written as a rotation, followed by a simple axis-aligned scaling, followed by another rotation. There are no exceptions and no fine print. Once you believe that, a surprising amount of machine learning stops looking like a bag of tricks and starts looking like consequences of one geometric picture.
The factorization
For any real matrix , the SVD is
where and are orthogonal — their columns are orthonormal, so they only rotate and reflect, never stretch — and is diagonal with non-negative entries called the singular values, written in descending order. The columns of are the left singular vectors and the columns of are the right singular vectors. The defining relationship is beautifully clean: , so each right singular vector is mapped by onto the corresponding left singular vector, stretched by exactly its singular value.
Reading it as geometry
The factorization is best understood by applying to a vector and watching the three stages in order, because is evaluated right to left.
First rotates the input. Because is orthogonal, just re-expresses in the coordinate system whose axes are the right singular vectors; nothing is stretched, lengths and angles are preserved. Then scales along each of those new axes independently — the first coordinate is multiplied by , the second by , and so on — which is the only stage that changes the shape of things, turning the unit sphere into an ellipsoid whose axis lengths are the singular values. Finally rotates the stretched result into the output space. So the action of any linear map decomposes into rotate, stretch, rotate, and the singular values are the complete inventory of how much the map stretches space and the singular vectors tell you the directions in which it does so. The largest singular value is literally the maximum factor by which can lengthen any unit vector, and the direction that gets stretched most is .
The rank of — the dimension of its output space, the number of independent directions it can produce — is simply the number of nonzero singular values. If through the end are all zero, the map collapses everything onto an -dimensional subspace, and the SVD has handed you that subspace explicitly as the first columns of .
Eckart–Young: truncation is optimal
Here is the theorem that makes the SVD indispensable rather than merely elegant. Suppose you are forced to approximate by a matrix of rank at most — you only get to keep stretch directions. Which ones should you keep? The intuitive answer, "the biggest," is provably the best one. Form the truncated SVD by keeping only the top components,
where and are the first columns of and and is the top-left block. The Eckart–Young theorem states that this is the best rank- approximation in the Frobenius norm — no other rank- matrix gets closer:
and it tells you exactly how much you give up, namely the energy in the discarded tail of the spectrum:
The practical reading is that the approximation error is governed entirely by the singular values you throw away. If the spectrum decays fast — if towers over — then a handful of directions hold almost all the matrix's energy and the rest can be dropped almost for free. The whole edifice of low-rank approximation, the subject of the next post, rests on this one identity.
Connection to the eigendecomposition
The SVD is not floating free of the rest of linear algebra; it is wired directly into eigenvalues. Form the symmetric matrix and substitute the factorization:
using . This is exactly the eigendecomposition of : the right singular vectors of are the eigenvectors of , and the singular values are the square roots of its eigenvalues. Symmetrically, the left singular vectors are the eigenvectors of . This is more than a curiosity — it is one way the SVD is actually computed, and it explains why singular values are always real and non-negative even when is rectangular and has no eigenvalues of its own: is symmetric and positive semidefinite, so its eigenvalues are real and non-negative, and their square roots are the singular values. For a symmetric positive semidefinite matrix the SVD and the eigendecomposition coincide outright, which is what makes the worked example below come out so cleanly.
Numerical rank and the decay that makes everything work
In theory a matrix is either rank- or it is not. In practice the distinction blurs, and that blur is exactly where the value lives. A matrix can be technically full rank — every singular value strictly positive — yet have a spectrum that falls off a cliff: large, then a long tail of values near machine epsilon. Its effective rank is the number of singular values large enough to matter, and it can be a tiny fraction of the nominal rank. Plot the singular values of a weight matrix from a trained transformer and you typically see precisely this shape — a steep initial drop and a long flat tail — which says, in the language of Eckart–Young, that the matrix is almost low-rank, that most of what it does lives in a small subspace. That empirical fact is not a theorem and it is not guaranteed, but it holds often enough that the entire low-rank-adaptation literature is built on it.
Where it shows up
The same factorization, repurposed, drives a remarkable range of methods. Principal Component Analysis is the SVD of the mean-centered data matrix: the top right singular vectors are the principal directions of variation, and projecting the data onto the top of them is the optimal -dimensional linear summary — optimal, again, by Eckart–Young. Image compression stores , , and instead of the full pixel grid; because photographs have fast-decaying spectra, keeping a few dozen components reproduces an image that looks essentially identical at a fraction of the storage. Latent Semantic Analysis runs the SVD on a word–document matrix and reads the top components as latent topics, so that words appearing in similar contexts land near each other even when they never co-occur — the original "embeddings," decades before the term. And the case that motivates this series: LLM weight matrices have fast singular-value decay, which is precisely the structure that LoRA exploits to adapt billion-parameter models by training a few million numbers.
Computing it at scale
The classical algorithms compute the full SVD in roughly time, which is fine for modest matrices and ruinous for the enormous ones in modern ML, especially when you only want the top few singular directions and not the whole decomposition. The standard modern tool is randomized SVD (Halko, Martinsson and Tropp, 2011). The idea is disarmingly simple: multiply by a small random matrix to get a tall, thin sketch whose columns are random linear combinations of 's columns. Because the action of is dominated by its top singular directions, that random sketch almost surely captures the dominant subspace; orthonormalize it, project onto it, and take an exact SVD of the resulting small matrix. The whole procedure costs about for a rank- approximation and is the workhorse behind fast PCA and large-scale low-rank factorization. You trade an exact answer for an approximate one whose error, thanks to Eckart–Young, you can bound by the tail you did not bother to compute.
A worked example
Take the symmetric matrix
Because it is symmetric and positive definite, its SVD coincides with its eigendecomposition. Notice that , the identity plus the all-ones matrix. The all-ones matrix has the single nonzero eigenvalue along the direction , so has eigenvalue there — you can check directly that . Every direction orthogonal to , such as , is left alone by the all-ones part and so has eigenvalue . The singular values are therefore
and the best rank-1 approximation keeps only the top component, . Since is one-third of the all-ones matrix,
Eckart–Young promises the error should be , and the residual does indeed have Frobenius norm — its energy is exactly the two singular values we discarded, no more and no less. That single number, the square root of the dropped tail, is the price of compression, and it is the quantity the next post spends in order to fit massive models on small machines.
For a longer treatment that goes straight from this factorization into the mechanics of LoRA, see the companion post SVD and Low-Rank Approximation: The Math Behind LoRA; the next post in this series takes the wider view, surveying the full landscape of rank-reduction strategies from truncated SVD to QLoRA, DoRA, and nuclear-norm minimization.