Classifier-Free Guidance: Steering Diffusion with a Signal
Conditioning a diffusion model on text gives you text-to-image generation. Classifier-free guidance makes that conditioning much stronger — at the cost of some diversity.
Conditioning a diffusion model on text gives you text-to-image generation. Classifier-free guidance makes that conditioning much stronger — at the cost of some diversity.
DDPM needs 1000 steps to generate a sample. DDIM reframes the reverse process as an ODE and gets the same quality in 50. The model weights are identical — only the sampling procedure changes.
DDPM defines a fixed forward process that gradually destroys an image into noise, then trains a neural network to reverse it. The math is tractable because each step is Gaussian.
DDPM, DDIM, and latent diffusion all use a U-Net backbone. DiT replaces it with a transformer — and finds that diffusion scales with model size the same way language models do.
Diffusion models learn to reverse a specific noise process. Flow matching learns to transport any source distribution to any target distribution along straight paths — simpler math, faster sampling, and better training signal.
Running DDPM in pixel space at 512×512 is expensive. Latent diffusion compresses the image into a small latent space first, runs the diffusion process there, and decodes back. The same quality, a fraction of the compute.
Diffusion models learn to reverse a noise process. The key insight is that you don't need to know the data distribution — you only need to learn its score function, the gradient of the log-density.