Blog Post
Positive Definite Matrices and the Geometry of Optimization
Positive definite matrices define 'bowl-shaped' quadratic forms with a unique minimum. They show up everywhere optimization problems have unique solutions — from least squares to neural network loss landscapes.
Views: –5 min readCite
In Post 10 we saw that a symmetric matrix's eigenvalues determine its behavior. When all eigenvalues are positive, something special happens: the matrix defines a bowl-shaped energy landscape with a unique minimum. This is the world of positive definite matrices.
The Definition
A symmetric matrix is positive definite (PD) if:
The expression is called a quadratic form. In 2D with :
This is a surface over the plane. Whether it's bowl-shaped, flat, or saddle-shaped depends entirely on .
Quadratic Form f(x,y) = ax² + 2bxy + dy²
Equivalent Conditions
There are several equivalent ways to check if is positive definite:
1. Eigenvalue condition: All eigenvalues .
From the spectral theorem , the quadratic form becomes: where . This is positive iff all .
2. Sylvester's criterion: All leading principal minors are positive.
For a matrix :
- (the determinant)
3. Cholesky factorization: for some invertible lower-triangular .
If such an exists (with positive diagonal), is PD.
4. Square root: for some invertible .
This follows from the spectral theorem: let .
The Geometry of Quadratic Forms
The level sets of are:
- PD: ellipses centered at the origin — each axis aligned with an eigenvector, with length
- PSD (some ): degenerate ellipses (flat in some directions)
- Indefinite (mixed signs): hyperbolas — no minimum, but a saddle point
Level Sets and Eigenvectors
The condition number describes the eccentricity of the ellipse. When , the ellipse is very elongated — gradient descent has to bounce back and forth in the narrow direction, making optimization slow. This is why preconditioning matters.
Connection to Convex Optimization
In optimization, we often minimize a function . The second derivative test generalizes to:
where is the Hessian matrix of second partial derivatives.
If everywhere, then has a unique global minimum — and gradient descent converges to it.
For linear regression with loss :
is always positive semidefinite, and positive definite if has full column rank. This is why the normal equations have a unique solution when is invertible.
The Newton Step
Newton's method uses the Hessian directly:
When the Hessian is PD, this step accounts for the curvature — it converges much faster than gradient descent. The Hessian being PD ensures the Newton step is a descent direction.
Cholesky Decomposition
For a PD matrix , the Cholesky decomposition computes where is lower triangular with positive diagonal entries.
The algorithm proceeds column by column:
If the algorithm hits a negative value under a square root, the matrix is not positive definite — this is how Cholesky serves as a PD test.
Cholesky Decomposition: A = LLᵀ
Cholesky is roughly twice as fast as LU decomposition for PD matrices and is the standard solver for symmetric PD systems (normal equations, covariance matrix inversions, etc.).
Positive Semidefinite (PSD)
A matrix is positive semidefinite if (with equality possible for ).
PSD matrices have . They arise naturally as:
- Covariance matrices: (PSD; PD if has full rank)
- Gram matrices: (always PSD)
- Kernel matrices: for valid kernels
The set of PSD matrices forms a convex cone — a closed, convex set in the space of matrices. This makes PSD constraints tractable in semidefinite programming (SDP).
Summary
| Class | Eigenvalues | Quadratic form | Geometry |
|---|---|---|---|
| Positive definite | All | Always | Elliptic bowl |
| Positive semidefinite | All | Always | Flat bowl (degenerate) |
| Indefinite | Mixed signs | Can be | Saddle |
| Negative definite | All | Always | Inverted bowl |
Positive definiteness is the matrix condition that guarantees unique minima, stable decompositions, and well-conditioned optimization. Whenever a problem is easy to solve, there's usually a PD matrix hiding behind it.
In Post 12 we'll apply all this to neural networks — where weight matrices, attention patterns, and gradient updates are all just linear algebra in action.