Blog Post
DDIM: Deterministic Sampling in Fewer Steps
DDPM needs 1000 steps to generate a sample. DDIM reframes the reverse process as an ODE and gets the same quality in 50. The model weights are identical — only the sampling procedure changes.
Views: –5 min readCite
The thousand-step sampling cost that ends DDPM looks like a property of the trained model, but it is not — it is a property of how that model is used at inference, and DDIM (Song et al., 2021) gets the same images in a fiftieth of the steps without retraining a single weight. The opening for this is hiding in plain sight in the training objective.
Look at what the loss actually depended on. The simplified DDPM objective only ever saw pairs , built from the closed-form marginal .
Nowhere in that expectation does the Markov chain appear — only the marginals do. The trained network knows how to denoise at every level, but the reverse process that strings those denoisings together is a separate choice, made at inference time, and DDPM's particular Markovian reverse chain is just one option among many.
DDIM exploits this by constructing a different forward process — a non-Markovian one. It defines a family of joint distributions , indexed by a vector of parameters , every one of which is engineered to have exactly the same marginals as DDPM. Identical marginals is the whole point: since the network was trained only against those marginals, the same weights are optimal for the entire family, and becomes a free dial that controls how much stochasticity the reverse process injects at each step — without invalidating the training.
Turning the dial produces a concrete reverse step. At each step the network predicts the noise , from which a prediction of the clean image follows immediately by inverting the marginal; DDIM then re-noises that prediction toward the next timestep and, optionally, adds fresh randomness scaled by .
The three terms read left to right as a recipe: estimate where the clean image is, point back toward along the predicted noise direction, and sprinkle in of new noise. Choosing to match DDPM's posterior variance recovers the original stochastic sampler exactly — so DDIM is a strict generalization, not a different model.
Set for every step and the last term vanishes, leaving a sampler with no randomness at all: is a deterministic function of and the network's output. This is the DDIM sampler proper, and determinism changes the character of the process. There is no longer a stochastic walk that explores; there is a fixed trajectory, computed by a function, from a noise sample to an image.
That trajectory has a clean continuous-time identity. As the step size shrinks, the deterministic DDIM update is the Euler discretization of the probability flow ODE — the ordinary differential equation whose solutions carry the same time-evolving marginals as the stochastic diffusion, but along smooth, non-random paths. (The probability-flow ODE and the broader score-SDE view of diffusion come from a separate, concurrent Song et al. paper, Score-Based Generative Modeling through SDEs, distinct from the DDIM paper above.)
The score is exactly what the network estimates — so the trained denoiser is the right-hand side of an ODE that maps noise to data, and sampling is numerical integration of that ODE from down to .
Reading sampling as ODE integration is what unlocks the speed. A smooth ODE does not need a thousand tiny steps to integrate accurately; it needs enough steps to track the curvature of its trajectory, and because the probability-flow trajectories are smooth, you can take a coarse subset of timesteps — say 50 or 20 out of the original 1000 — and skip the rest, treating each retained step as a larger integration stride. At 50 steps the samples are close to indistinguishable from full 1000-step DDPM; at 20 there is visible degradation but often an acceptable trade, and this coarse-integration recipe is what nearly every production diffusion system runs at inference. The same network, integrated more coarsely, is the entire saving.
Determinism buys a second thing that stochastic sampling cannot give: a stable correspondence between noise and image. With the map from the initial to the final is a fixed function, so the same seed always yields the same picture, and — more usefully — nearby seeds yield related pictures. Linearly interpolating between two noise vectors and decoding each produces a smooth morph between the two images, semantic interpolation done entirely in noise space. And because the ODE runs in both directions, you can integrate it forward from a real image to recover the noise that would generate it — DDIM inversion — which is the entry point for editing a given image by manipulating its latent code rather than generating from scratch.
Fifty steps is a large improvement over a thousand, but every one of those steps still runs a U-Net over the full-resolution image, and at 512×512 that resolution is itself the dominant cost. The next part attacks the pixels rather than the steps: compress the image into a small latent space first, run the entire diffusion process there, and decode only at the end — that is latent diffusion.