S. Roy

Blog Post

Positive Definite Matrices and the Geometry of Optimization

Positive definite matrices define 'bowl-shaped' quadratic forms with a unique minimum. They show up everywhere optimization problems have unique solutions — from least squares to neural network loss landscapes.

Views: 5 min readCite

In Post 10 we saw that a symmetric matrix's eigenvalues determine its behavior. When all eigenvalues are positive, something special happens: the matrix defines a bowl-shaped energy landscape with a unique minimum. This is the world of positive definite matrices.


The Definition

A symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} is positive definite (PD) if:

xAx>0for all x0\mathbf{x}^\top A \mathbf{x} > 0 \quad \text{for all } \mathbf{x} \neq \mathbf{0}

The expression f(x)=xAxf(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} is called a quadratic form. In 2D with A=(abbd)A = \begin{pmatrix} a & b \\ b & d \end{pmatrix}:

f(x,y)=ax2+2bxy+dy2f(x, y) = ax^2 + 2bxy + dy^2

This is a surface over the (x,y)(x, y) plane. Whether it's bowl-shaped, flat, or saddle-shaped depends entirely on AA.

Quadratic Form f(x,y) = ax² + 2bxy + dy²

Positive Definiteλ₁=2.31, λ₂=1.19, det=2.75
Blue = low, Red = high. White dot = minimum (PD only).

Equivalent Conditions

There are several equivalent ways to check if AA is positive definite:

1. Eigenvalue condition: All eigenvalues λi>0\lambda_i > 0.

From the spectral theorem A=QΛQA = Q\Lambda Q^\top, the quadratic form becomes: xAx=yΛy=iλiyi2\mathbf{x}^\top A \mathbf{x} = \mathbf{y}^\top \Lambda \mathbf{y} = \sum_i \lambda_i y_i^2 where y=Qx\mathbf{y} = Q^\top \mathbf{x}. This is positive iff all λi>0\lambda_i > 0.

2. Sylvester's criterion: All leading principal minors are positive.

For a 2×22 \times 2 matrix (abbd)\begin{pmatrix} a & b \\ b & d \end{pmatrix}:

  • a>0a > 0
  • adb2>0ad - b^2 > 0 (the determinant)

3. Cholesky factorization: A=LLA = L L^\top for some invertible lower-triangular LL.

If such an LL exists (with positive diagonal), AA is PD.

4. Square root: A=BBA = B^\top B for some invertible BB.

This follows from the spectral theorem: let B=Λ1/2QB = \Lambda^{1/2} Q^\top.


The Geometry of Quadratic Forms

The level sets of f(x)=xAx=cf(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} = c are:

  • PD: ellipses centered at the origin — each axis aligned with an eigenvector, with length 1/λi1/\sqrt{\lambda_i}
  • PSD (some λi=0\lambda_i = 0): degenerate ellipses (flat in some directions)
  • Indefinite (mixed signs): hyperbolas — no minimum, but a saddle point

Level Sets and Eigenvectors

q₁ (λ=3)q₂ (λ=1)Positive Definite — Ellipse level sets

The condition number κ(A)=λmax/λmin\kappa(A) = \lambda_{\max} / \lambda_{\min} describes the eccentricity of the ellipse. When κ1\kappa \gg 1, the ellipse is very elongated — gradient descent has to bounce back and forth in the narrow direction, making optimization slow. This is why preconditioning matters.


Connection to Convex Optimization

In optimization, we often minimize a function f:RnRf: \mathbb{R}^n \to \mathbb{R}. The second derivative test generalizes to:

2f(x)0f is strictly convex at x\nabla^2 f(\mathbf{x}) \succ 0 \quad \Longrightarrow \quad f \text{ is strictly convex at } \mathbf{x}

where 2f\nabla^2 f is the Hessian matrix of second partial derivatives.

If 2f(x)0\nabla^2 f(\mathbf{x}) \succ 0 everywhere, then ff has a unique global minimum — and gradient descent converges to it.

For linear regression with loss f(w)=Xwy2f(\mathbf{w}) = \|X\mathbf{w} - \mathbf{y}\|^2: 2f=2XX\nabla^2 f = 2 X^\top X

XXX^\top X is always positive semidefinite, and positive definite if XX has full column rank. This is why the normal equations have a unique solution when XXX^\top X is invertible.

The Newton Step

Newton's method uses the Hessian directly: wt+1=wt(2f(wt))1f(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - (\nabla^2 f(\mathbf{w}_t))^{-1} \nabla f(\mathbf{w}_t)

When the Hessian is PD, this step accounts for the curvature — it converges much faster than gradient descent. The Hessian being PD ensures the Newton step is a descent direction.


Cholesky Decomposition

For a PD matrix AA, the Cholesky decomposition computes A=LLA = LL^\top where LL is lower triangular with positive diagonal entries.

The algorithm proceeds column by column:

L11=A11L_{11} = \sqrt{A_{11}} Li1=Ai1L11for i>1L_{i1} = \frac{A_{i1}}{L_{11}} \quad \text{for } i > 1 Ljj=Ajjk=1j1Ljk2L_{jj} = \sqrt{A_{jj} - \sum_{k=1}^{j-1} L_{jk}^2} Lij=1Ljj(Aijk=1j1LikLjk)for i>jL_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik} L_{jk}\right) \quad \text{for } i > j

If the algorithm hits a negative value under a square root, the matrix is not positive definite — this is how Cholesky serves as a PD test.

Cholesky Decomposition: A = LLᵀ

A4.00002.003.0001.000.502.00=L2.00001.001.4100.500.001.32×Lᵀ2.00000.001.4100.000.001.32
Hover cells to highlight. L is lower triangular. Zero entries in Lᵀ shown in gray.

Cholesky is roughly twice as fast as LU decomposition for PD matrices and is the standard solver for symmetric PD systems (normal equations, covariance matrix inversions, etc.).


Positive Semidefinite (PSD)

A matrix is positive semidefinite if xAx0\mathbf{x}^\top A \mathbf{x} \geq 0 (with equality possible for x0\mathbf{x} \neq 0).

PSD matrices have λi0\lambda_i \geq 0. They arise naturally as:

  • Covariance matrices: Σ=1nXX\Sigma = \frac{1}{n} X^\top X (PSD; PD if XX has full rank)
  • Gram matrices: Gij=xixjG_{ij} = \mathbf{x}_i \cdot \mathbf{x}_j (always PSD)
  • Kernel matrices: Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) for valid kernels

The set of n×nn \times n PSD matrices forms a convex cone — a closed, convex set in the space of matrices. This makes PSD constraints tractable in semidefinite programming (SDP).


Summary

ClassEigenvaluesQuadratic formGeometry
Positive definiteAll >0> 0Always >0> 0Elliptic bowl
Positive semidefiniteAll 0\geq 0Always 0\geq 0Flat bowl (degenerate)
IndefiniteMixed signsCan be <0< 0Saddle
Negative definiteAll <0< 0Always <0< 0Inverted bowl

Positive definiteness is the matrix condition that guarantees unique minima, stable decompositions, and well-conditioned optimization. Whenever a problem is easy to solve, there's usually a PD matrix hiding behind it.

In Post 12 we'll apply all this to neural networks — where weight matrices, attention patterns, and gradient updates are all just linear algebra in action.

Cite this work

Generated from article front matter.

Roy, Swastik. (2026). Positive Definite Matrices and the Geometry of Optimization. S. Roy. https://swastikroy.me/blog/linear-algebra-positive-definite

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.