Linear Algebra in Neural Networks

Swastik Roy

Blog Post

Linear Algebra in Neural Networks

Every layer of a neural network is a matrix multiplication followed by a nonlinearity. Understanding what these matrices do geometrically — how they stretch, rotate, and project — explains why deep learning works.

July 2, 2026Views: –5 min readCite

linear-algebra neural-networks transformers attention deep-learning

Everything we've built so far — vectors, matrix transformations, eigendecompositions, SVD — lives inside every modern neural network. This post makes those connections explicit.

Linear Layers are Matrix Multiplications

A fully connected layer transforms an input $\mathbf{x} \in \mathbb{R}^n$ to an output $\mathbf{y} \in \mathbb{R}^m$ :

$\mathbf{y} = W\mathbf{x} + \mathbf{b}$

where $W \in \mathbb{R}^{m \times n}$ is the weight matrix and $\mathbf{b} \in \mathbb{R}^m$ is a bias vector.

Geometrically, $W$ does everything a matrix can do: rotate, scale, project, expand. The bias $\mathbf{b}$ shifts the result.

Why nonlinearities? Without them, composing linear layers collapses to a single linear layer:

$(W_3 W_2 W_1)\mathbf{x} = W_{\text{eff}}\mathbf{x}$

No matter how many layers, a linear-only network is just one matrix multiplication. Nonlinearities like ReLU, GELU, and Swish break this — they introduce bends in the decision boundary that let the network represent non-linear functions.

Linear Layer: y = Wx + b — Drag the blue point

W[1,1] = 1.5W[1,2] = 0.5W[2,1] = -0.5W[2,2] = 1.2

Show bias b = (0.0, 0.0)

The SVD of Weight Matrices

Every weight matrix $W$ has an SVD (see Post 9):

$W = U \Sigma V^\top$

This reveals a clean geometric story:

$V^\top$ rotates the input into a canonical frame
$\Sigma$ scales each dimension independently (with $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$ )
$U$ rotates into the output space

The singular values tell you which directions in the input are amplified and which are suppressed. Large $\sigma_i$ → that direction matters a lot. Small $\sigma_i$ → nearly ignored.

The Attention Mechanism

The attention mechanism in Transformers is pure linear algebra.

Given an input sequence of $n$ tokens, each embedded as a $d$ -dimensional vector, the attention layer first applies three linear projections:

$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$

where $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ are the query, key, and value projection matrices.

The attention scores are:

$A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)$

$QK^\top \in \mathbb{R}^{n \times n}$ is a matrix of dot products — entry $(i, j)$ measures how similar token $i$ 's query is to token $j$ 's key. The $\sqrt{d_k}$ factor prevents the dot products from growing too large (which would make softmax very peaked).

The output is the weighted sum of values:

$\text{Attention}(Q, K, V) = A \cdot V$

Each output token is a linear combination of all value vectors, weighted by attention.

Inner product interpretation: $QK^\top$ is computing $n^2$ inner products simultaneously. High dot product = high similarity = high attention weight. The attention matrix $A$ is row-stochastic (rows sum to 1 after softmax) — each output token is a convex combination of value vectors.

Attention Mechanism — Click a token

Attention weights for "cat"

0.381

"cat"

0.283

"sat"

0.336

"mat"

Q (queries)

"cat": [0.80, 0.20]

"sat": [0.10, 0.90]

"mat": [0.50, 0.50]

K (keys)

"cat": [0.90, 0.10]

"sat": [0.20, 0.80]

"mat": [0.60, 0.40]

Output for "cat"

[0.549, 0.451]

= weighted sum of V rows

Why Depth Works

A single linear layer can only carve the input space into two half-spaces (a hyperplane). Composing $L$ linear layers with nonlinearities can create up to $O(2^n)$ linear regions in the input space (where $n$ is the number of neurons).

More precisely: a network with $L$ layers of width $w$ can represent functions with exponentially more "bends" than any shallow network with the same number of parameters. This is why depth — not width — is often the key to expressiveness.

LoRA: Low-Rank Adaptation

After training, large language models have billions of parameters. Fine-tuning all of them is expensive.

LoRA (Low-Rank Adaptation) exploits the observation that fine-tuning updates tend to be low-rank. Instead of updating $W \in \mathbb{R}^{m \times n}$ directly, we parameterize the update as:

$W = W_0 + AB$

where $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$ with $r \ll \min(m, n)$ .

The number of parameters drops from $mn$ to $r(m + n)$ . For $m = n = 4096$ and $r = 16$ :

Full update: $16.7M$ parameters
LoRA: $131K$ parameters — 128× fewer

This is a direct application of the low-rank matrix approximation idea from Post 9. LoRA works because fine-tuning updates explore a low-dimensional subspace of the full parameter space.

LoRA: W ≈ W₀ + AB (low-rank update)

Rank r = 2

192

Full W params (12×16)

56

LoRA params (r=2)

71%

Parameter savings

Gradient Flow Through Layers

Backpropagation through a linear layer $\mathbf{y} = W\mathbf{x}$ is itself a matrix multiplication.

If the loss is $\ell$ and $\frac{\partial \ell}{\partial \mathbf{y}} = \boldsymbol{\delta}$ , then by the chain rule:

$\frac{\partial \ell}{\partial \mathbf{x}} = W^\top \boldsymbol{\delta}, \qquad \frac{\partial \ell}{\partial W} = \boldsymbol{\delta} \mathbf{x}^\top$

The gradient with respect to the input passes through $W^\top$ — the transpose of the forward weight matrix. Backpropagation is exactly the chain rule of matrix calculus.

This is why the condition number of $W$ matters: a badly conditioned $W$ amplifies gradients in some directions and crushes them in others, causing the exploding/vanishing gradient problem.

Summary

Neural network concept	Linear algebra concept
Linear layer $W\mathbf{x} + \mathbf{b}$	Matrix-vector multiplication
Layer composition (no nonlinearity)	Matrix product
Expressiveness of deep nets	Piecewise linear regions via nonlinear folding
Attention scores $QK^\top$	Gram matrix of inner products
Attention output	Weighted average (row of stochastic matrix × $V$ )
Weight matrix structure	SVD: rotation, scale, rotation
LoRA fine-tuning	Low-rank matrix decomposition
Backpropagation	Chain rule via matrix transpose

Neural networks are linear algebra engines with nonlinear hinges inserted. Understanding the linear algebra gives you the geometry; understanding the nonlinearities gives you the expressiveness.

In Post 13 we synthesize everything from the whole series into one unified picture.

Linear Algebra in Neural Networks

Linear Layers are Matrix Multiplications

Linear Layer: y = Wx + b — Drag the blue point

The SVD of Weight Matrices

The Attention Mechanism

Attention Mechanism — Click a token

Why Depth Works

LoRA: Low-Rank Adaptation

LoRA: W ≈ W₀ + AB (low-rank update)

Gradient Flow Through Layers

Summary

How to cite this article

Cite this work