Modified: August 31, 2022
diffusion model
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Diffusion models for image generation were independently invented at least twice:
- in a discrete-time variational inference framework (developed by Jascha Sohl-Dickstein and others)
- in a continuous-time SDE 'score-matching' framework (developed by Stefano Ermon, Yang Song, etc.)
and these are now considered to be different perspectives on the same model family. I'll start here by trying to explain the Bayesian/VI take.
Denoising diffusion probabilistic models
(reference: Ho, Jain, Abbeel 2020)
Consider the following generative model for images:
- Start by sampling Gaussian noise , for some positive (which will turn out to be a per-step noise variance).We could simply use a standard normal, but taking variance here helps avoid the need for rescaling further down that makes the expressions in the original paper more complicated.
- For , let for some parameterized functions (we'll discuss some natural choices for these below).
This defines a joint distribution
and in particular a marginal distribution on the final image , where are taken as latent variables. We can view this as a deep latent Gaussian model (or hierarchical VAE). Given parameters it is straightforward to sample from this process to generate images. But how can we train such a system?
Generally we'd think of using variational inference with a multilevel 'encoder' or approximating posterior . The key trick is that we can simplify things by just fixing this to be the iterative application of Gaussian noise:
where each simply adds noise to the previous step. This gives the ELBO
Assuming a fixed variance , the term is constant and can be ignored, leaving us with a KL divergence term for each timestep
comparing the generative distribution with the 'forward posterior' , in expectation over . Since the forward process is a simple Gaussian diffusion, this last expression has the closed form
and the posterior can be derived in closed formBy observing that it is proportional to and applying a multivariate gaussian identity. as
with mean
This suggests that we should take our generative variance to equal that of the target we're comparing against; that is, take to be the constant . This choice simplifies the KL divergence of multivariate normals , so that we derive
How should we choose ? If we reparameterize to write for , then our target mean becomes
which suggests a generative model of the form
Under this choice (and using the reparameterization just discussed) the divergence reduces to
where is now clearly being trained as a 'denoising' model attempting to model a step towards the noise-free image ; note also that we pulled out a factor of from the squared error. It turns out that we can also write the final term in a form very similar to the other terms,
Training. Now we have a practical training procedure. For each input image , we
- Sample a target timestep uniformly at random.
- Sample Gaussian white noise.
- Compute the stochastic loss term comparing the 'denoised' image to the original input.
- Take a gradient descent step , or the equivalent using your gradient-based optimizer of choice (Adam, etc.)
Scaling. In the above we've defined the forward process as a simple random walk, so that the scale of increases with . As a practical matter (e.g., for numerical stability), we may wish to work with the normalized iterates which have unit scale (assuming that has unit scale), as the paper (Ho, Jain, Abbeel 2020) does. This introduces some scaling factors in the loss, but apparently it works better to ignore them anyway?
Per-timestep variance: Ho et al. use a variance schedule that can differ across timesteps. This makes the math a bit uglier, but is probably helpful? See the paper for details. Later papers like the classifier-free guidance paper (linked below) formulate this instead in continuous time as a nonuniform distribution over the timestep , which seems cleaner to me.
Conditional diffusion
Suppose we want to generate images conditioned on a label or caption . The obvious thing to me would be to learn a gradient model that incorporates the conditioning information. But it seems that people actually do various other things that might or might not be equivalent/related?
Classifier guidance (Dhariwal and Nichol, 2021): Given an image classifier , we define the joint distribution , which by fixing becomes an unnormalized conditional distribution . The gradients of this density wrt are just the gradients of (which we are estimating as ) plus the gradient . Incorporating the latter term in the generation process (with a tunable weight ) pushes the model towards the appropriate conditional slice of image space.
Classifier-free guidance (Ho and Salimans, 2021): instead of training a separate classifier model, we instead train an unconditional diffusion model and a conditional model ; these can in fact be modeled using the same network trained by randomly dropping out the conditioning information to ensure that it also learns the unconditional model. Then we generate using the learned 'gradient'
which we interpret as classifier guidance using the classifier implicitly defined by the ratio of the conditional and unconditional models (note we can ignore the factor since this is constant wrt and so does not contribute to the gradient).
Text encoding: for a text2image model, we represent the conditioning text by an embedding vector. This can come from a generic pretrained language model (Google's Imagen found that using a large capable language model gives more coherent, logical generations) or from a CLIP model trained to co-embed images and text.
High resolution
latent diffusion super-resolution