S. Roy

Blog Post

DDPM: The Diffusion Process, Forward and Reverse

DDPM defines a fixed forward process that gradually destroys an image into noise, then trains a neural network to reverse it. The math is tractable because each step is Gaussian.

Views: 6 min readCite

The single noise scale from denoising score matching is enough to estimate a score, but it is not enough to sample from scratch — a denoiser trained at one σ\sigma knows how to clean up images that are nearly clean, or images that are nearly pure noise, but not how to walk the whole distance between them. DDPM (Ho et al., 2020) fixes this by stringing together a thousand noise scales into a single process: a fixed, untrainable forward process that destroys an image one small step at a time, and a learned reverse process that rebuilds it.

The forward process is a Markov chain that, at each step, scales the current image down slightly and adds a little Gaussian noise. The amount added at step tt is set by a schedule β1,,βT\beta_1, \dots, \beta_T.

q(xtxt1)=N ⁣(xt;1βtxt1,  βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\!\big(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t I\big)

The 1βt\sqrt{1 - \beta_t} scaling is what keeps the chain from exploding: it shrinks the signal by exactly the amount needed so that the variance stays bounded as noise accumulates, and after T=1000T = 1000 steps the image has been ground down into something indistinguishable from a draw from N(0,I)\mathcal{N}(0, I).

Running that chain one step at a time during training would be ruinously slow, so the crucial move is to collapse it. Because each step is a linear-Gaussian map, the composition of tt of them is itself Gaussian, and defining αt=1βt\alpha_t = 1 - \beta_t together with the cumulative product αˉt=s=1tαs\bar\alpha_t = \prod_{s=1}^{t} \alpha_s gives the marginal of xtx_t given the original x0x_0 in closed form.

q(xtx0)=N ⁣(xt;αˉtx0,  (1αˉt)I)q(x_t \mid x_0) = \mathcal{N}\!\big(x_t;\, \sqrt{\bar\alpha_t}\, x_0,\; (1 - \bar\alpha_t) I\big)

This says you can jump to any timestep in a single shot — sample noise once and form xtx_t directly from x0x_0 — without simulating the intervening 999 steps.

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)

That one identity is what makes training tractable at all: each gradient step picks a random tt, builds xtx_t from a clean image and a single noise draw, and never touches the chain.

The reverse process is the one the model has to learn, because going backward — denoising — is not free. It is also a Markov chain, parameterized as a sequence of Gaussians whose means and covariances the network produces.

pθ(xt1xt)=N ⁣(xt1;μθ(xt,t),  Σθ(xt,t))p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\big(x_{t-1};\, \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\big)

The reason a Gaussian is even the right form for the reverse step is that, in the limit of small βt\beta_t, the true reverse conditional is itself approximately Gaussian — so the network is not forced into a mismatched family, only asked to find the right mean.

The target for that mean comes from a quantity that is tractable: the forward posterior q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0), the distribution over the previous state given both the current state and the original image. Conditioning on x0x_0 turns the otherwise-unknown reverse step into a Gaussian you can write down exactly, with a mean μ~t(xt,x0)\tilde\mu_t(x_t, x_0) that interpolates between xtx_t and x0x_0. The network's job is to match this posterior mean — but at sampling time it does not have x0x_0, so it must predict whatever is needed to reconstruct the mean from xtx_t alone.

Here DDPM makes the choice that defines it. Rather than predicting x0x_0 or the mean directly, the network predicts the noise ϵθ(xt,t)\epsilon_\theta(x_t, t) that was added to produce xtx_t — and because xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon, knowing ϵ\epsilon is equivalent to knowing x0x_0. Substituting the predicted noise into the posterior mean gives the reverse-step mean in terms of ϵθ\epsilon_\theta.

μθ(xt,t)=1αt ⁣(xtβt1αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t, t)\right)

The variational bound that nominally governs training weights each timestep differently, but Ho et al. found that throwing those weights away — regressing the predicted noise onto the true noise with a plain squared error — works better in practice and could not be simpler.

Lsimple=Et,x0,ϵ ⁣[ϵϵθ(xt,t)2]L_\text{simple} = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\, \big\| \epsilon - \epsilon_\theta(x_t, t) \big\|^2 \,\right]

This is denoising score matching from Part 1, now indexed by tt instead of a single σ\sigma: at every noise level the network learns to name the noise, which is the same as learning the score at that level.

What computes ϵθ\epsilon_\theta is a U-Net. The architecture matters because denoising is a problem with two scales at once: the noise to remove is high-frequency and local, but deciding what the underlying image is requires global context, and a U-Net's encoder–decoder structure with skip connections is built to carry both — the contracting path summarizes the image into coarse semantics, the expanding path reconstructs spatial detail, and the skips splice fine resolution back in so it is not lost in the bottleneck. Ho et al. add residual blocks throughout and self-attention at the lower resolutions, where the feature maps are small enough to afford it and global interactions matter most. The timestep tt is turned into a sinusoidal embedding, projected through a small MLP, and added into every residual block, so the same weights behave differently depending on how much noise they are being asked to remove. (The adaptive group normalization conditioning often associated with diffusion U-Nets is not from DDPM — it was introduced a year later by Dhariwal and Nichol (2021); DDPM's timestep conditioning is the simpler additive embedding.)

The schedule β1,,βT\beta_1, \dots, \beta_T is the last design choice, and it is not innocuous. Ho et al. use a linear schedule, ramping β\beta from 10410^{-4} to 0.020.02 over the thousand steps. Nichol and Dhariwal (2021) later showed this destroys the signal too aggressively at the start — by the middle of the chain the image is already almost pure noise, so the late timesteps carry little information and the model wastes capacity on them. Their cosine schedule fixes the pacing by defining the cumulative αˉt\bar\alpha_t directly as a slow cosine decay.

αˉt=cos2 ⁣(t/T+0.0081.008π2)cos2 ⁣(0.0081.008π2)\bar\alpha_t = \frac{\cos^2\!\big(\tfrac{t/T + 0.008}{1.008} \cdot \tfrac{\pi}{2}\big)}{\cos^2\!\big(\tfrac{0.008}{1.008} \cdot \tfrac{\pi}{2}\big)}

Keeping more signal alive through the middle of the process gives the network a smoother sequence of denoising problems and measurably better samples, and most schedules since have been variations on this shape.

The result is a model that generates beautifully and samples miserably. Drawing one image means starting from xTN(0,I)x_T \sim \mathcal{N}(0, I) and running the reverse chain all the way down, one learned Gaussian step at a time, which is one full U-Net forward pass for each of the thousand steps. For a 256×256 image that is minutes per sample on a good GPU, and the cost is structural — it comes from the chain being Markovian, where each step is only allowed to look one step back. The next part breaks exactly that assumption and recovers the same images in fifty steps instead of a thousand — that is DDIM.

Cite this work

Generated from article front matter.

Roy, Swastik. (2024). DDPM: The Diffusion Process, Forward and Reverse. S. Roy. https://swastikroy.me/blog/diffusion-ddpm

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.