S. Roy

Blog Post

SVD: The Geometry of Every Matrix

Every matrix, no matter how ugly, factors into a rotation, a scaling, and another rotation. That single fact — the singular value decomposition — is the engine behind PCA, image compression, latent semantic analysis, and the low-rank tricks that let us fine-tune giant models cheaply.

Views: 9 min readCite

The first post in this series was about one meaning of the word rank — the ordering of competitors by strength. This post and the next are about the other meaning, the rank of a matrix, and the bridge between the two is the most useful factorization in applied mathematics: the singular value decomposition. Its claim is audacious in its generality. Take any matrix at all — rectangular, singular, full of noise, it does not matter — and it can be written as a rotation, followed by a simple axis-aligned scaling, followed by another rotation. There are no exceptions and no fine print. Once you believe that, a surprising amount of machine learning stops looking like a bag of tricks and starts looking like consequences of one geometric picture.

The factorization

For any real matrix WRm×nW \in \mathbb{R}^{m \times n}, the SVD is

W=UΣVW = U \Sigma V^{\top}

where URm×mU \in \mathbb{R}^{m \times m} and VRn×nV \in \mathbb{R}^{n \times n} are orthogonal — their columns are orthonormal, so they only rotate and reflect, never stretch — and ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is diagonal with non-negative entries σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0 called the singular values, written in descending order. The columns of UU are the left singular vectors and the columns of VV are the right singular vectors. The defining relationship is beautifully clean: Wvi=σiuiW v_i = \sigma_i u_i, so each right singular vector is mapped by WW onto the corresponding left singular vector, stretched by exactly its singular value.

Reading it as geometry

The factorization is best understood by applying WW to a vector xx and watching the three stages in order, because Wx=UΣVxW x = U \Sigma V^{\top} x is evaluated right to left.

First VV^{\top} rotates the input. Because VV is orthogonal, VxV^{\top} x just re-expresses xx in the coordinate system whose axes are the right singular vectors; nothing is stretched, lengths and angles are preserved. Then Σ\Sigma scales along each of those new axes independently — the first coordinate is multiplied by σ1\sigma_1, the second by σ2\sigma_2, and so on — which is the only stage that changes the shape of things, turning the unit sphere into an ellipsoid whose axis lengths are the singular values. Finally UU rotates the stretched result into the output space. So the action of any linear map decomposes into rotate, stretch, rotate, and the singular values are the complete inventory of how much the map stretches space and the singular vectors tell you the directions in which it does so. The largest singular value σ1\sigma_1 is literally the maximum factor by which WW can lengthen any unit vector, and the direction that gets stretched most is v1v_1.

The rank of WW — the dimension of its output space, the number of independent directions it can produce — is simply the number of nonzero singular values. If σr+1\sigma_{r+1} through the end are all zero, the map collapses everything onto an rr-dimensional subspace, and the SVD has handed you that subspace explicitly as the first rr columns of UU.

Eckart–Young: truncation is optimal

Here is the theorem that makes the SVD indispensable rather than merely elegant. Suppose you are forced to approximate WW by a matrix of rank at most kk — you only get to keep kk stretch directions. Which ones should you keep? The intuitive answer, "the biggest," is provably the best one. Form the truncated SVD by keeping only the top kk components,

Wk=UkΣkVkW_k = U_k \Sigma_k V_k^{\top}

where UkU_k and VkV_k are the first kk columns of UU and VV and Σk\Sigma_k is the top-left k×kk \times k block. The Eckart–Young theorem states that this is the best rank-kk approximation in the Frobenius norm — no other rank-kk matrix gets closer:

Wk=argminrank(A)kWAFW_k = \arg\min_{\operatorname{rank}(A) \leq k} \lVert W - A \rVert_F

and it tells you exactly how much you give up, namely the energy in the discarded tail of the spectrum:

WWkF=i>kσi2\lVert W - W_k \rVert_F = \sqrt{\sum_{i > k} \sigma_i^2}

The practical reading is that the approximation error is governed entirely by the singular values you throw away. If the spectrum decays fast — if σ1\sigma_1 towers over σ50\sigma_{50} — then a handful of directions hold almost all the matrix's energy and the rest can be dropped almost for free. The whole edifice of low-rank approximation, the subject of the next post, rests on this one identity.

Connection to the eigendecomposition

The SVD is not floating free of the rest of linear algebra; it is wired directly into eigenvalues. Form the symmetric matrix WWW^{\top} W and substitute the factorization:

WW=VΣUUΣV=VΣΣV=VΣ2VW^{\top} W = V \Sigma^{\top} U^{\top} U \Sigma V^{\top} = V \Sigma^{\top} \Sigma V^{\top} = V \Sigma^2 V^{\top}

using UU=IU^{\top} U = I. This is exactly the eigendecomposition of WWW^{\top}W: the right singular vectors of WW are the eigenvectors of WWW^{\top}W, and the singular values are the square roots of its eigenvalues. Symmetrically, the left singular vectors are the eigenvectors of WWW W^{\top}. This is more than a curiosity — it is one way the SVD is actually computed, and it explains why singular values are always real and non-negative even when WW is rectangular and has no eigenvalues of its own: WWW^{\top}W is symmetric and positive semidefinite, so its eigenvalues are real and non-negative, and their square roots are the singular values. For a symmetric positive semidefinite matrix the SVD and the eigendecomposition coincide outright, which is what makes the worked example below come out so cleanly.

Numerical rank and the decay that makes everything work

In theory a matrix is either rank-rr or it is not. In practice the distinction blurs, and that blur is exactly where the value lives. A matrix can be technically full rank — every singular value strictly positive — yet have a spectrum that falls off a cliff: σ1,σ2,σ3\sigma_1, \sigma_2, \sigma_3 large, then a long tail of values near machine epsilon. Its effective rank is the number of singular values large enough to matter, and it can be a tiny fraction of the nominal rank. Plot the singular values of a weight matrix from a trained transformer and you typically see precisely this shape — a steep initial drop and a long flat tail — which says, in the language of Eckart–Young, that the matrix is almost low-rank, that most of what it does lives in a small subspace. That empirical fact is not a theorem and it is not guaranteed, but it holds often enough that the entire low-rank-adaptation literature is built on it.

Where it shows up

The same factorization, repurposed, drives a remarkable range of methods. Principal Component Analysis is the SVD of the mean-centered data matrix: the top right singular vectors are the principal directions of variation, and projecting the data onto the top kk of them is the optimal kk-dimensional linear summary — optimal, again, by Eckart–Young. Image compression stores UkU_k, Σk\Sigma_k, and VkV_k instead of the full pixel grid; because photographs have fast-decaying spectra, keeping a few dozen components reproduces an image that looks essentially identical at a fraction of the storage. Latent Semantic Analysis runs the SVD on a word–document matrix and reads the top components as latent topics, so that words appearing in similar contexts land near each other even when they never co-occur — the original "embeddings," decades before the term. And the case that motivates this series: LLM weight matrices have fast singular-value decay, which is precisely the structure that LoRA exploits to adapt billion-parameter models by training a few million numbers.

Computing it at scale

The classical algorithms compute the full SVD in roughly O(mnmin(m,n))O(mn\min(m, n)) time, which is fine for modest matrices and ruinous for the enormous ones in modern ML, especially when you only want the top few singular directions and not the whole decomposition. The standard modern tool is randomized SVD (Halko, Martinsson and Tropp, 2011). The idea is disarmingly simple: multiply WW by a small random matrix to get a tall, thin sketch whose columns are random linear combinations of WW's columns. Because the action of WW is dominated by its top singular directions, that random sketch almost surely captures the dominant subspace; orthonormalize it, project WW onto it, and take an exact SVD of the resulting small matrix. The whole procedure costs about O(mnk)O(mnk) for a rank-kk approximation and is the workhorse behind fast PCA and large-scale low-rank factorization. You trade an exact answer for an approximate one whose error, thanks to Eckart–Young, you can bound by the tail you did not bother to compute.

A worked example

Take the symmetric matrix

W=[211121112]W = \begin{bmatrix} 2 & 1 & 1 \\ 1 & 2 & 1 \\ 1 & 1 & 2 \end{bmatrix}

Because it is symmetric and positive definite, its SVD coincides with its eigendecomposition. Notice that W=I+11W = I + \mathbf{1}\mathbf{1}^{\top}, the identity plus the all-ones matrix. The all-ones matrix has the single nonzero eigenvalue 33 along the direction v1=13(1,1,1)v_1 = \tfrac{1}{\sqrt{3}}(1, 1, 1)^{\top}, so WW has eigenvalue 1+3=41 + 3 = 4 there — you can check directly that Wv1=(4,4,4)/3=4v1W v_1 = (4, 4, 4)^{\top}/\sqrt{3} = 4 v_1. Every direction orthogonal to v1v_1, such as (1,1,0)(1, -1, 0)^{\top}, is left alone by the all-ones part and so has eigenvalue 1+0=11 + 0 = 1. The singular values are therefore

σ1=4,σ2=1,σ3=1\sigma_1 = 4, \qquad \sigma_2 = 1, \qquad \sigma_3 = 1

and the best rank-1 approximation keeps only the top component, W1=σ1v1v1W_1 = \sigma_1 v_1 v_1^{\top}. Since v1v1=1311v_1 v_1^{\top} = \tfrac{1}{3}\mathbf{1}\mathbf{1}^{\top} is one-third of the all-ones matrix,

W1=41311=[4/34/34/34/34/34/34/34/34/3]W_1 = 4 \cdot \tfrac{1}{3}\,\mathbf{1}\mathbf{1}^{\top} = \begin{bmatrix} 4/3 & 4/3 & 4/3 \\ 4/3 & 4/3 & 4/3 \\ 4/3 & 4/3 & 4/3 \end{bmatrix}

Eckart–Young promises the error should be σ22+σ32=1+1=2\sqrt{\sigma_2^2 + \sigma_3^2} = \sqrt{1 + 1} = \sqrt{2}, and the residual WW1W - W_1 does indeed have Frobenius norm 2\sqrt{2} — its energy is exactly the two singular values we discarded, no more and no less. That single number, the square root of the dropped tail, is the price of compression, and it is the quantity the next post spends in order to fit massive models on small machines.

For a longer treatment that goes straight from this factorization into the mechanics of LoRA, see the companion post SVD and Low-Rank Approximation: The Math Behind LoRA; the next post in this series takes the wider view, surveying the full landscape of rank-reduction strategies from truncated SVD to QLoRA, DoRA, and nuclear-norm minimization.

Cite this work

Generated from article front matter.

Roy, Swastik. (2026). SVD: The Geometry of Every Matrix. S. Roy. https://swastikroy.me/blog/svd-singular-value-decomposition

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.