Blog Post
Linear Algebra in Neural Networks
Every layer of a neural network is a matrix multiplication followed by a nonlinearity. Understanding what these matrices do geometrically — how they stretch, rotate, and project — explains why deep learning works.
Views: –5 min readCite
Everything we've built so far — vectors, matrix transformations, eigendecompositions, SVD — lives inside every modern neural network. This post makes those connections explicit.
Linear Layers are Matrix Multiplications
A fully connected layer transforms an input to an output :
where is the weight matrix and is a bias vector.
Geometrically, does everything a matrix can do: rotate, scale, project, expand. The bias shifts the result.
Why nonlinearities? Without them, composing linear layers collapses to a single linear layer:
No matter how many layers, a linear-only network is just one matrix multiplication. Nonlinearities like ReLU, GELU, and Swish break this — they introduce bends in the decision boundary that let the network represent non-linear functions.
Linear Layer: y = Wx + b — Drag the blue point
The SVD of Weight Matrices
Every weight matrix has an SVD (see Post 9):
This reveals a clean geometric story:
- rotates the input into a canonical frame
- scales each dimension independently (with )
- rotates into the output space
The singular values tell you which directions in the input are amplified and which are suppressed. Large → that direction matters a lot. Small → nearly ignored.
The Attention Mechanism
The attention mechanism in Transformers is pure linear algebra.
Given an input sequence of tokens, each embedded as a -dimensional vector, the attention layer first applies three linear projections:
where are the query, key, and value projection matrices.
The attention scores are:
is a matrix of dot products — entry measures how similar token 's query is to token 's key. The factor prevents the dot products from growing too large (which would make softmax very peaked).
The output is the weighted sum of values:
Each output token is a linear combination of all value vectors, weighted by attention.
Inner product interpretation: is computing inner products simultaneously. High dot product = high similarity = high attention weight. The attention matrix is row-stochastic (rows sum to 1 after softmax) — each output token is a convex combination of value vectors.
Attention Mechanism — Click a token
Why Depth Works
A single linear layer can only carve the input space into two half-spaces (a hyperplane). Composing linear layers with nonlinearities can create up to linear regions in the input space (where is the number of neurons).
More precisely: a network with layers of width can represent functions with exponentially more "bends" than any shallow network with the same number of parameters. This is why depth — not width — is often the key to expressiveness.
LoRA: Low-Rank Adaptation
After training, large language models have billions of parameters. Fine-tuning all of them is expensive.
LoRA (Low-Rank Adaptation) exploits the observation that fine-tuning updates tend to be low-rank. Instead of updating directly, we parameterize the update as:
where and with .
The number of parameters drops from to . For and :
- Full update: parameters
- LoRA: parameters — 128× fewer
This is a direct application of the low-rank matrix approximation idea from Post 9. LoRA works because fine-tuning updates explore a low-dimensional subspace of the full parameter space.
LoRA: W ≈ W₀ + AB (low-rank update)
Gradient Flow Through Layers
Backpropagation through a linear layer is itself a matrix multiplication.
If the loss is and , then by the chain rule:
The gradient with respect to the input passes through — the transpose of the forward weight matrix. Backpropagation is exactly the chain rule of matrix calculus.
This is why the condition number of matters: a badly conditioned amplifies gradients in some directions and crushes them in others, causing the exploding/vanishing gradient problem.
Summary
| Neural network concept | Linear algebra concept |
|---|---|
| Linear layer | Matrix-vector multiplication |
| Layer composition (no nonlinearity) | Matrix product |
| Expressiveness of deep nets | Piecewise linear regions via nonlinear folding |
| Attention scores | Gram matrix of inner products |
| Attention output | Weighted average (row of stochastic matrix × ) |
| Weight matrix structure | SVD: rotation, scale, rotation |
| LoRA fine-tuning | Low-rank matrix decomposition |
| Backpropagation | Chain rule via matrix transpose |
Neural networks are linear algebra engines with nonlinear hinges inserted. Understanding the linear algebra gives you the geometry; understanding the nonlinearities gives you the expressiveness.
In Post 13 we synthesize everything from the whole series into one unified picture.