S. Roy

Blog Post

Matrices as Linear Maps

A matrix is not just a grid of numbers — it's a function that transforms space. This post builds the geometric intuition for matrix-vector multiplication as rotation, scaling, and shearing.

Views: 8 min readCite

Every time a transformer model computes attention, it performs a sequence of matrix multiplications. The query, key, and value matrices — WQW_Q, WKW_K, WVW_V — transform token embeddings into new spaces where similarity can be measured. The attention weights are computed as:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

That QKQK^\top is a matrix product. So is the projection WQxW_Q x that maps a token embedding xx into the query space. If you want to understand why transformers work the way they do, you need to understand what a matrix multiplication does to a vector — not just how to compute it, but what it means geometrically. A matrix is not a grid of numbers. It is a function that transforms space.

What Is a Linear Map?

A linear map (also called a linear transformation) is a function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m that satisfies two properties:

Additivity: f(u+v)=f(u)+f(v)f(\mathbf{u} + \mathbf{v}) = f(\mathbf{u}) + f(\mathbf{v})

Homogeneity: f(αu)=αf(u)f(\alpha \mathbf{u}) = \alpha f(\mathbf{u})

These two rules can be combined into a single condition: for any scalars α,β\alpha, \beta and vectors u,v\mathbf{u}, \mathbf{v},

f(αu+βv)=αf(u)+βf(v)f(\alpha \mathbf{u} + \beta \mathbf{v}) = \alpha f(\mathbf{u}) + \beta f(\mathbf{v})

This is called superposition. It says that linear maps respect the structure of vector addition and scalar multiplication. They do not bend, curve, or shift the origin — they can only stretch, rotate, and shear.

Some examples of linear maps:

  • Rotating the plane by 30°
  • Projecting vectors onto the xx-axis
  • Stretching every vector by a factor of 3
  • Reflecting across the yy-axis

Some things that are not linear maps:

  • Translating every vector by a fixed offset (it moves the origin)
  • Squaring: f(x)=x2f(x) = x^2 (fails additivity)
  • Any function with f(0)0f(\mathbf{0}) \neq \mathbf{0} (homogeneity requires f(0)=0f(\mathbf{0}) = \mathbf{0})

The remarkable fact is that every linear map from Rn\mathbb{R}^n to Rm\mathbb{R}^m can be written as a matrix multiplication. Matrices are the language of linear maps.

Matrices as Linear Maps: The Column Picture

Let AA be a 2×22 \times 2 matrix:

A=[abcd]A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}

The product AxA\mathbf{x} for x=[x1,x2]\mathbf{x} = [x_1, x_2]^\top is:

Ax=[abcd][x1x2]=x1[ac]+x2[bd]A\mathbf{x} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = x_1 \begin{bmatrix} a \\ c \end{bmatrix} + x_2 \begin{bmatrix} b \\ d \end{bmatrix}

This is a linear combination of the columns of AA. The first column [a,c][a, c]^\top is exactly where the standard basis vector e1=[1,0]\mathbf{e}_1 = [1, 0]^\top lands:

Ae1=[abcd][10]=[ac]A\mathbf{e}_1 = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\begin{bmatrix}1\\0\end{bmatrix} = \begin{bmatrix}a\\c\end{bmatrix}

The second column [b,d][b, d]^\top is where e2=[0,1]\mathbf{e}_2 = [0, 1]^\top lands:

Ae2=[abcd][01]=[bd]A\mathbf{e}_2 = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\begin{bmatrix}0\\1\end{bmatrix} = \begin{bmatrix}b\\d\end{bmatrix}

This is the key insight: the columns of a matrix tell you where the basis vectors go. Once you know where e1\mathbf{e}_1 and e2\mathbf{e}_2 land, you know where everything lands — because every vector is a linear combination of the basis vectors, and linear maps preserve linear combinations.

If x=x1e1+x2e2\mathbf{x} = x_1 \mathbf{e}_1 + x_2 \mathbf{e}_2, then:

Ax=A(x1e1+x2e2)=x1Ae1+x2Ae2A\mathbf{x} = A(x_1 \mathbf{e}_1 + x_2 \mathbf{e}_2) = x_1 A\mathbf{e}_1 + x_2 A\mathbf{e}_2

The entire transformation is determined by where the basis lands.

a1.0
b0.0
c0.0
d1.0
Matrix A
[ 1.0   0.0 ]
[ 0.0   1.0 ]
e₁ = (1,0) → (1.00, 0.00)  col 1
e₂ = (0,1) → (0.00, 1.00)  col 2
Ae₁Ae₂

Dashed gray = original unit square. Purple = transformed square. Red/blue arrows show where basis vectors land — these are exactly the columns of the matrix.

Drag the sliders to change the matrix entries. Notice how the columns of the matrix correspond exactly to where the red and blue basis vectors land. The gray unit square deforms into a parallelogram — that parallelogram is the image of the square under the linear map.

Geometric Transformations in 2D

Different matrix structures produce recognizable geometric effects.

Rotation

To rotate vectors counterclockwise by angle θ\theta:

Rθ=[cosθsinθsinθcosθ]R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

For θ=45°\theta = 45°:

R45°=[22222222][0.7070.7070.7070.707]R_{45°} = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} \\[4pt] \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} \end{bmatrix} \approx \begin{bmatrix} 0.707 & -0.707 \\ 0.707 & 0.707 \end{bmatrix}

The columns tell the story: e1=[1,0]\mathbf{e}_1 = [1,0]^\top rotates to [0.707,0.707][0.707, 0.707]^\top (pointing northeast), and e2=[0,1]\mathbf{e}_2 = [0,1]^\top rotates to [0.707,0.707][-0.707, 0.707]^\top (pointing northwest).

Scaling

Scaling by factor sxs_x in the xx-direction and sys_y in the yy-direction:

S=[sx00sy]S = \begin{bmatrix} s_x & 0 \\ 0 & s_y \end{bmatrix}

Uniform scaling by factor 2: S=[2002]S = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}. Every vector doubles in length; the square becomes a larger square.

Shearing

A horizontal shear shifts the xx-coordinate by kk times the yy-coordinate:

H=[1k01]H = \begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}

For k=1k = 1: e1\mathbf{e}_1 stays fixed at [1,0][1,0]^\top, but e2=[0,1]\mathbf{e}_2 = [0,1]^\top moves to [1,1][1,1]^\top. The unit square tilts into a parallelogram. This is exactly what happens when you drag the top of a rectangle sideways while holding the bottom fixed.

Reflection

Reflection across the yy-axis flips the sign of the xx-coordinate:

Fy=[1001]F_y = \begin{bmatrix} -1 & 0 \\ 0 & 1 \end{bmatrix}

e1\mathbf{e}_1 maps to [1,0][-1,0]^\top (flipped left), e2\mathbf{e}_2 stays at [0,1][0,1]^\top.

Rotate 45°
[ cos45° −sin45° ] [ sin45° cos45° ]

Click each preset to animate the unit square transforming. Dashed gray = original.

Matrix-Matrix Multiplication as Function Composition

If AA and BB are both linear maps, what is their composition ABA \circ B? The function that first applies BB, then applies AA?

(AB)(x)=A(Bx)(A \circ B)(\mathbf{x}) = A(B\mathbf{x})

The matrix that represents this composition is the matrix product ABAB:

(AB)x=A(Bx)(AB)\mathbf{x} = A(B\mathbf{x})

This is not just a notational convenience — it is the definition. Matrix multiplication is defined precisely so that it corresponds to function composition. To compute ABAB, the jj-th column of ABAB is AA applied to the jj-th column of BB:

(AB)j=ABj(AB)_j = A \cdot B_j

For two 2×22 \times 2 matrices:

AB=[a11a12a21a22][b11b12b21b22]=[a11b11+a12b21a11b12+a12b22a21b11+a22b21a21b12+a22b22]AB = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}b_{11}+a_{12}b_{21} & a_{11}b_{12}+a_{12}b_{22} \\ a_{21}b_{11}+a_{22}b_{21} & a_{21}b_{12}+a_{22}b_{22} \end{bmatrix}

Each entry (AB)ij(AB)_{ij} is the dot product of the ii-th row of AA with the jj-th column of BB. But the deeper meaning is composition: first apply BB, then apply AA.

Why Order Matters: ABBAAB \neq BA

Function composition is not commutative in general. "First rotate, then scale" is different from "first scale, then rotate" — actually, in this case they happen to be the same. But "first rotate, then shear" is genuinely different from "first shear, then rotate."

Let RR be a 90° rotation and HH be a horizontal shear with k=1k=1:

R=[0110],H=[1101]R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}, \quad H = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}

Then:

RH=[0110][1101]=[0111]RH = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}\begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 0 & -1 \\ 1 & 1 \end{bmatrix}

HR=[1101][0110]=[1110]HR = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}\begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 1 & -1 \\ 1 & 0 \end{bmatrix}

RHHRRH \neq HR. Applying shear then rotating is not the same as rotating then shearing. The unit square ends up in different positions depending on the order.

A = Rotate 45°  |  B = Scale 2×
Left: AB·x = A(Bx) — scale first, then rotate  Right: BA·x = B(Ax) — rotate first, then scale
ABBA
AB ≈ [1.41, -1.41] [1.41, 1.41]
BA ≈ [1.41, -1.41] [1.41, 1.41]

AB ≠ BA: the two transformed squares land in different positions. Order of composition matters.

When writing (AB)x(AB)\mathbf{x}, remember: BB acts first, AA acts second. The matrix on the right acts first. This right-to-left reading order trips up many newcomers — it is a consequence of function composition notation.

Special Matrices

The Identity Matrix

The identity matrix II is the linear map that changes nothing:

I=[1001]I = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

Ix=xI\mathbf{x} = \mathbf{x} for every x\mathbf{x}. The columns are exactly the standard basis vectors — e1\mathbf{e}_1 stays at e1\mathbf{e}_1 and e2\mathbf{e}_2 stays at e2\mathbf{e}_2. For any matrix AA, we have AI=IA=AAI = IA = A. The identity is the matrix analogue of multiplying by 1.

The Zero Matrix

The zero matrix OO sends every vector to the zero vector:

O=[0000],Ox=0O = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}, \quad O\mathbf{x} = \mathbf{0}

Every column is zero: both basis vectors collapse to the origin. This is the most destructive linear map — it loses all information. For any matrix AA, we have OA=AO=OOA = AO = O.

Putting It Together

The vocabulary of this post gives you the tools to read modern ML papers at a deeper level:

  • When a paper writes WxW\mathbf{x}, it means a linear map applied to x\mathbf{x}WW rotates, scales, and shears the input into a new space.
  • When attention computes QKQK^\top, it is composing two linear maps to measure alignment between queries and keys.
  • When a network stacks layers f3(f2(f1(x)))f_3(f_2(f_1(\mathbf{x}))), the learned weight matrices compose like W3W2W1W_3 W_2 W_1 — the rightmost acts first.

In the next post, we will ask: when can a linear map be undone? That is the question of invertibility, and answering it will lead us to determinants.

Cite this work

Generated from article front matter.

Roy, Swastik. (2026). Matrices as Linear Maps. S. Roy. https://swastikroy.me/blog/linear-algebra-matrices

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.