S. Roy

Blog Post

Linear Algebra for LLMs: Vectors, Matrices, and What They Do

Every forward pass is a sequence of matrix multiplications. Understanding what those matrices do — rotate, scale, project — is the foundation for understanding why transformers work.

Views: 6 min readCite

A forward pass through a transformer is, stripped of its names, a long chain of matrix multiplications interrupted by a few nonlinearities. If you want to know why attention attends, why LoRA can fine-tune a seven-billion-parameter model by training a few million numbers, or why weight matrices can be compressed without much loss, the answers all live in linear algebra — not as abstract theorems but as concrete facts about what a matrix does to a vector. So that is where this series starts.

The first object is the vector. A token embedding is a point in Rd\mathbb{R}^d — the word "cat" becomes a specific list of dd numbers, a single location in a 768-dimensional space for a small model or a 4096-dimensional one for a large model. There is nothing metaphorical about the geometry: the model genuinely places words as points, and the distances between those points encode whatever semantic structure training has managed to discover. Words the model treats as similar end up close together; words it treats as unrelated end up far apart. The embedding matrix is just a lookup table whose ii-th row is the point assigned to token ii.

A matrix turns one such space into another. A matrix WRm×nW \in \mathbb{R}^{m \times n} is a linear map that takes a vector in Rn\mathbb{R}^n and returns a vector in Rm\mathbb{R}^m, and the cleanest way to read it is column by column: the jj-th column of WW is the place where the jj-th standard basis vector lands after the transformation.

Wx=j=1nxjW:,jW x = \sum_{j=1}^{n} x_j \, W_{:,j}

So applying WW to xx is nothing more than mixing the columns of WW with the coordinates of xx as the recipe — every attention projection that produces queries, keys, and values, and every weight in the feed-forward block, is one of these maps reshaping the residual stream into a new space.

The most important property of such a map is its rank, because rank is a hard ceiling on what the map can express. The rank of WW is the dimension of the space its outputs can actually fill, and a matrix of rank rr can only ever produce vectors lying in some rr-dimensional subspace of its output space, no matter how many distinct inputs you feed it. This is the entire premise behind LoRA: instead of updating a full weight matrix during fine-tuning, you add a low-rank correction.

ΔW=BA,BRd×r,ARr×k,rmin(d,k)\Delta W = B A, \qquad B \in \mathbb{R}^{d \times r}, \quad A \in \mathbb{R}^{r \times k}, \quad r \ll \min(d, k)

Because a product through an rr-dimensional bottleneck can have rank at most rr, this update is forced to be low-rank by construction — and the bet, which holds up empirically, is that the useful part of a fine-tuning update already lives in a small subspace, so constraining it there costs almost nothing while slashing the number of trainable parameters.

Multiplication by a matrix is one way vectors interact; the dot product is the other, and it measures alignment. For two vectors the dot product factors into their lengths and the cosine of the angle between them, which means it is large when they point the same way and small or negative when they do not.

ab=abcosθa \cdot b = \lVert a \rVert \, \lVert b \rVert \cos\theta

This is exactly the machinery of attention: the score QiKj/dkQ_i \cdot K_j / \sqrt{d_k} is a scaled dot product between a query and a key, so a high score means the query and key vectors are aligned, which after the softmax becomes a high attention weight — attention is, at bottom, the geometry of which keys point in the same direction as which queries.

Where do all these transformed vectors go? Into the residual stream, and the way they accumulate there is itself a linear-algebraic fact worth stating plainly. At layer ll the stream is a vector xlRdx_l \in \mathbb{R}^d, and each sublayer does not replace it but adds to it.

xl+1=xl+Sublayer(xl)x_{l+1} = x_l + \text{Sublayer}(x_l)

Because every layer reads from and writes to the same dd-dimensional space, the residual stream behaves like a shared communication channel that runs the full depth of the network — early layers can deposit information that much later layers retrieve, which is part of why residual connections matter so much for training stability and why this picture is central to the architecture internals of how transformers actually train.

To analyze what a matrix is doing, it helps to find the directions it treats most simply. For a symmetric matrix AA there is a basis in which the action of AA is pure scaling — no rotation, no mixing — and that basis is given by the eigendecomposition.

A=QΛQA = Q \Lambda Q^{\top}

Here QQ is orthogonal, its columns are the eigenvectors, and Λ\Lambda is diagonal with the eigenvalues: along each eigenvector the matrix simply stretches by the corresponding eigenvalue and does nothing else. Covariance matrices have exactly this form, and so does the Fisher information matrix that describes the local curvature of the loss — which is why eigendecomposition is the right lens for understanding how adaptive optimizers like Adam implicitly approximate the natural gradient, a thread we pick up in Part 3.

For a general, non-square matrix there is no eigendecomposition, but there is something just as powerful: the singular value decomposition factors any matrix into a rotation, a scaling, and another rotation.

W=UΣVW = U \Sigma V^{\top}

The diagonal entries of Σ\Sigma are the singular values, and they say how much WW stretches the input along each of its principal directions; the directions with the largest singular values carry the most of the matrix's "energy." Keeping only the top rr singular directions gives the best possible rank-rr approximation of WW, which is both the formal justification for why low-rank fine-tuning like LoRA can work and the reason trained weight matrices so often turn out to be approximately low-rank in practice — a few directions do most of the work.

Once you can see a matrix as something that rotates, scales, and projects vectors, the architecture stops being a wall of symbols and becomes a sequence of geometric operations you can reason about. The next question is what those vectors represent when the model is uncertain — and uncertainty is the language of probability, which is where Part 2 goes.

Cite this work

Generated from article front matter.

Roy, Swastik. (2024). Linear Algebra for LLMs: Vectors, Matrices, and What They Do. S. Roy. https://swastikroy.me/blog/math-llm-linear-algebra

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.