S. Roy

Blog Post

Putting It All Together — From Vectors to Transformers

A tour through the whole series: how vectors, matrices, eigendecomposition, SVD, and least squares combine to explain the mathematical machinery inside modern ML systems — from PCA to attention to gradient descent.

Views: 6 min readCite

This is the final post in the series. No new math — just synthesis. We'll walk through a complete story from raw data to transformer, with each step grounded in the concepts we've built.


1. Data Lives in a Vector Space

Every piece of data is a vector. An image with 28×2828 \times 28 pixels is a vector in R784\mathbb{R}^{784}. A sentence tokenized into nn tokens is a sequence of vectors in Rd\mathbb{R}^d (one per token, after embedding).

In Post 1 we defined vectors, their geometry, and what it means to add them and scale them. In Post 2 we saw that vector spaces are the natural habitat for data — closed under addition and scaling, spanned by a basis.

The key insight: the dimension of the space is the number of features. Working in R784\mathbb{R}^{784} is conceptually the same as working in R3\mathbb{R}^3 — just with more axes.


2. Linear Maps Transform the Space

A matrix WRm×nW \in \mathbb{R}^{m \times n} is a linear map from Rn\mathbb{R}^n to Rm\mathbb{R}^m. It stretches, rotates, and projects.

In Post 3 we studied how matrices compose (multiply), what the rank says about a matrix's image, and how to think of column space and null space geometrically.

In Post 4 we added the inner product — a way to measure angles and lengths in vector spaces. The inner product uv=uvcosθ\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta lets us define similarity as cosine of angle, which is the foundation of attention.


3. Eigendecomposition: Finding the Natural Axes

Given a matrix AA, its eigenvectors qi\mathbf{q}_i are the directions AA only scales — not rotates:

Aqi=λiqiA\mathbf{q}_i = \lambda_i \mathbf{q}_i

For symmetric matrices (covariance matrices, graph Laplacians, Hessians), the spectral theorem (Post 10) guarantees:

A=QΛQA = Q \Lambda Q^\top

with orthogonal QQ and real Λ\Lambda. The eigenvectors are the natural coordinate system of the transformation.

PCA is the canonical application. The covariance matrix C=1nXXC = \frac{1}{n} X^\top X is symmetric PSD. Its eigenvectors (principal components) are the directions of maximum variance. Project data onto the top-kk eigenvectors to reduce dimensionality while preserving maximum information.


4. SVD: Eigendecomposition for Non-Square Matrices

Most data matrices are non-square (mm samples, nn features, rarely m=nm = n). The SVD from Post 9 generalizes eigendecomposition:

X=UΣVX = U \Sigma V^\top

  • VV columns: directions in feature space (right singular vectors = principal components)
  • Σ\Sigma diagonal: singular values = eigenvalues of XX\sqrt{\text{eigenvalues of } X^\top X}
  • UU columns: directions in sample space

SVD gives us:

  • Low-rank approximation: keep top-kk singular values → compress the matrix
  • Pseudoinverse: X+=VΣ+UX^+ = V \Sigma^+ U^\top → solve least-squares even when XX is singular
  • LoRA: fine-tune only a low-rank delta W=W0+ABW = W_0 + AB

5. Solving Systems: Least Squares and Normal Equations

When we train a linear model, we minimize Xwy2\|X\mathbf{w} - \mathbf{y}\|^2. The solution is the normal equations (Post 8):

XXw=XyX^\top X \mathbf{w} = X^\top \mathbf{y}

The matrix XXX^\top X is symmetric positive semidefinite. If it's invertible (positive definite), the unique solution is:

w=(XX)1Xy=X+y\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y} = X^+ \mathbf{y}

We covered the factorizations that make this numerically tractable in Post 7: LU for general systems, QR for least squares (more stable), Cholesky (Post 11) for PD systems.


6. Neural Networks: Composing Linear Maps with Nonlinearities

A neural network layer is y=σ(Wx+b)\mathbf{y} = \sigma(W\mathbf{x} + \mathbf{b}) where σ\sigma is a nonlinearity.

As we showed in Post 12:

  • Without nonlinearities, deep networks collapse to one linear map
  • Backpropagation is the chain rule applied to matrix operations: gradients flow through WW^\top
  • The Hessian being positive definite (Post 11) guarantees a unique minimum

Attention in transformers is: Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

QKQK^\top is a matrix of inner products — measuring similarity between queries and keys. Softmax turns it row-stochastic. The output is a weighted sum of value vectors. Pure linear algebra, with softmax as the only nonlinearity.


7. The Full ML Pipeline

The ML Pipeline — Hover a node

Raw DataEmbedPCA / SVDLinear LayerAttentionOutput
StepMathPost
Raw dataVectors in Rn\mathbb{R}^n1, 2
PreprocessingLinear maps, projections3, 4
Dimensionality reductionPCA = SVD of covariance5, 9
Linear model trainingLeast squares, normal equations8
Deep learning forward passMatrix multiplications + nonlinearities12
Attention mechanismInner products + softmax12
Fine-tuning (LoRA)Low-rank matrix decomposition9, 12
Optimization (Newton)Hessian = PD matrix → unique min11

Concept Map

Concept Map — Click a node to see connections

VectorsPost 1Vector SpacesPost 2MatricesPost 3Inner ProductsPost 4EigenvaluesPost 5DeterminantsPost 6LU / QRPost 7Least SquaresPost 8SVDPost 9Spectral ThmPost 10PD MatricesPost 11Neural NetsPost 12SynthesisPost 13

Quick Review Quiz

Test yourself on the key ideas from the series:

Quick Review Quiz

1 / 5

What does the spectral theorem say about a real symmetric matrix A?

Score: 0

Where to Go Next

You now have a working foundation in the linear algebra that underlies modern ML. Here are natural next steps:

Differential equations — Many dynamical systems (RNNs, differential transformers, physics-informed networks) are governed by ODEs/PDEs. The eigenvalues of the system matrix determine stability. Highly recommended: Strang's Differential Equations and Linear Algebra.

Fourier analysis — The discrete Fourier transform (DFT) is a linear transformation: x^=Fx\hat{\mathbf{x}} = F\mathbf{x} where FF is a unitary matrix (orthogonal over C\mathbb{C}). Convolution in the spatial domain = element-wise multiplication in the Fourier domain. CNNs are intimately related.

Probability and statistics — The multivariate Gaussian distribution is entirely described by its mean vector and covariance matrix (symmetric PSD). Maximum likelihood estimation leads to least squares. Bayesian inference involves PD matrices everywhere.

Numerical linear algebra — How do you actually compute eigenvalues for a 106×10610^6 \times 10^6 matrix? Power iteration, Lanczos, randomized SVD. The algorithms matter enormously at scale.

Convex optimization — Gradient descent, Newton's method, proximal methods — all build on the linear algebra of Hessians, gradients (vectors), and constraint matrices. Boyd & Vandenberghe's Convex Optimization is the standard reference.


Closing Thought

Linear algebra is the operating system of machine learning. Every tensor operation, every gradient update, every attention pattern is a linear algebraic object. Understanding what the matrices are doing — geometrically, not just numerically — is what separates someone who can train models from someone who can design them.

Thanks for following along. The math is only the beginning.

Cite this work

Generated from article front matter.

Roy, Swastik. (2026). Putting It All Together — From Vectors to Transformers. S. Roy. https://swastikroy.me/blog/linear-algebra-putting-together

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.