S. Roy

Blog Post

Linear Algebra in Neural Networks

Every layer of a neural network is a matrix multiplication followed by a nonlinearity. Understanding what these matrices do geometrically — how they stretch, rotate, and project — explains why deep learning works.

Views: 5 min readCite

Everything we've built so far — vectors, matrix transformations, eigendecompositions, SVD — lives inside every modern neural network. This post makes those connections explicit.


Linear Layers are Matrix Multiplications

A fully connected layer transforms an input xRn\mathbf{x} \in \mathbb{R}^n to an output yRm\mathbf{y} \in \mathbb{R}^m:

y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}

where WRm×nW \in \mathbb{R}^{m \times n} is the weight matrix and bRm\mathbf{b} \in \mathbb{R}^m is a bias vector.

Geometrically, WW does everything a matrix can do: rotate, scale, project, expand. The bias b\mathbf{b} shifts the result.

Why nonlinearities? Without them, composing linear layers collapses to a single linear layer:

(W3W2W1)x=Weffx(W_3 W_2 W_1)\mathbf{x} = W_{\text{eff}}\mathbf{x}

No matter how many layers, a linear-only network is just one matrix multiplication. Nonlinearities like ReLU, GELU, and Swish break this — they introduce bends in the decision boundary that let the network represent non-linear functions.

Linear Layer: y = Wx + b — Drag the blue point

x = (0.80, 0.60)Wx+b = (1.50, 0.32)

The SVD of Weight Matrices

Every weight matrix WW has an SVD (see Post 9):

W=UΣVW = U \Sigma V^\top

This reveals a clean geometric story:

  1. VV^\top rotates the input into a canonical frame
  2. Σ\Sigma scales each dimension independently (with σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0)
  3. UU rotates into the output space

The singular values tell you which directions in the input are amplified and which are suppressed. Large σi\sigma_i → that direction matters a lot. Small σi\sigma_i → nearly ignored.


The Attention Mechanism

The attention mechanism in Transformers is pure linear algebra.

Given an input sequence of nn tokens, each embedded as a dd-dimensional vector, the attention layer first applies three linear projections:

Q=XWQ,K=XWK,V=XWVQ = X W_Q, \quad K = X W_K, \quad V = X W_V

where WQ,WK,WVRd×dkW_Q, W_K, W_V \in \mathbb{R}^{d \times d_k} are the query, key, and value projection matrices.

The attention scores are:

A=softmax ⁣(QKdk)A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)

QKRn×nQK^\top \in \mathbb{R}^{n \times n} is a matrix of dot products — entry (i,j)(i, j) measures how similar token ii's query is to token jj's key. The dk\sqrt{d_k} factor prevents the dot products from growing too large (which would make softmax very peaked).

The output is the weighted sum of values:

Attention(Q,K,V)=AV\text{Attention}(Q, K, V) = A \cdot V

Each output token is a linear combination of all value vectors, weighted by attention.

Inner product interpretation: QKQK^\top is computing n2n^2 inner products simultaneously. High dot product = high similarity = high attention weight. The attention matrix AA is row-stochastic (rows sum to 1 after softmax) — each output token is a convex combination of value vectors.

Attention Mechanism — Click a token

Attention weights for "cat"
0.381
"cat"
0.283
"sat"
0.336
"mat"
Q (queries)
"cat": [0.80, 0.20]
"sat": [0.10, 0.90]
"mat": [0.50, 0.50]
K (keys)
"cat": [0.90, 0.10]
"sat": [0.20, 0.80]
"mat": [0.60, 0.40]
Output for "cat"
[0.549, 0.451]
= weighted sum of V rows

Why Depth Works

A single linear layer can only carve the input space into two half-spaces (a hyperplane). Composing LL linear layers with nonlinearities can create up to O(2n)O(2^n) linear regions in the input space (where nn is the number of neurons).

More precisely: a network with LL layers of width ww can represent functions with exponentially more "bends" than any shallow network with the same number of parameters. This is why depth — not width — is often the key to expressiveness.


LoRA: Low-Rank Adaptation

After training, large language models have billions of parameters. Fine-tuning all of them is expensive.

LoRA (Low-Rank Adaptation) exploits the observation that fine-tuning updates tend to be low-rank. Instead of updating WRm×nW \in \mathbb{R}^{m \times n} directly, we parameterize the update as:

W=W0+ABW = W_0 + AB

where ARm×rA \in \mathbb{R}^{m \times r} and BRr×nB \in \mathbb{R}^{r \times n} with rmin(m,n)r \ll \min(m, n).

The number of parameters drops from mnmn to r(m+n)r(m + n). For m=n=4096m = n = 4096 and r=16r = 16:

  • Full update: 16.7M16.7M parameters
  • LoRA: 131K131K parameters — 128× fewer

This is a direct application of the low-rank matrix approximation idea from Post 9. LoRA works because fine-tuning updates explore a low-dimensional subspace of the full parameter space.

LoRA: W ≈ W₀ + AB (low-rank update)

192
Full W params (12×16)
56
LoRA params (r=2)
71%
Parameter savings

Gradient Flow Through Layers

Backpropagation through a linear layer y=Wx\mathbf{y} = W\mathbf{x} is itself a matrix multiplication.

If the loss is \ell and y=δ\frac{\partial \ell}{\partial \mathbf{y}} = \boldsymbol{\delta}, then by the chain rule:

x=Wδ,W=δx\frac{\partial \ell}{\partial \mathbf{x}} = W^\top \boldsymbol{\delta}, \qquad \frac{\partial \ell}{\partial W} = \boldsymbol{\delta} \mathbf{x}^\top

The gradient with respect to the input passes through WW^\top — the transpose of the forward weight matrix. Backpropagation is exactly the chain rule of matrix calculus.

This is why the condition number of WW matters: a badly conditioned WW amplifies gradients in some directions and crushes them in others, causing the exploding/vanishing gradient problem.


Summary

Neural network conceptLinear algebra concept
Linear layer Wx+bW\mathbf{x} + \mathbf{b}Matrix-vector multiplication
Layer composition (no nonlinearity)Matrix product
Expressiveness of deep netsPiecewise linear regions via nonlinear folding
Attention scores QKQK^\topGram matrix of inner products
Attention outputWeighted average (row of stochastic matrix × VV)
Weight matrix structureSVD: rotation, scale, rotation
LoRA fine-tuningLow-rank matrix decomposition
BackpropagationChain rule via matrix transpose

Neural networks are linear algebra engines with nonlinear hinges inserted. Understanding the linear algebra gives you the geometry; understanding the nonlinearities gives you the expressiveness.

In Post 13 we synthesize everything from the whole series into one unified picture.

Cite this work

Generated from article front matter.

Roy, Swastik. (2026). Linear Algebra in Neural Networks. S. Roy. https://swastikroy.me/blog/linear-algebra-neural-networks

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.