Denoising Diffusion Probabilistic Models

Introduction

In the realm of image synthesis, a groundbreaking approach has emerged through Denoising Diffusion Probabilistic Models. This technique, inspired by nonequilibrium thermodynamics, represents a significant leap forward, blending the complexity of image generation with the elegance of probabilistic modeling.

The Core Concept

Diffusion models are a class of latent variable models that simulate a Markov chain process. They transform data into a noisy state and then learn to reverse this process, effectively reconstructing the original data from noise.

Forward and Reverse Processes

The forward process is a gradual addition of Gaussian noise, transforming the original data into a nearly pure noise state. Conversely, the reverse process aims to reconstruct the original data, requiring the model to learn intricate patterns and dependencies within the data.

The Role of \( \beta_t \) and Learning

A critical aspect is the variance of the noise added at each step (\( \beta_t \)). This can be preset or learned during training through a technique called reparameterization, enhancing the model's adaptability and accuracy.

Understanding the Forward Process: Equation 4

A key feature of the forward process in diffusion models is its ability to sample \( x_t \), the data at any timestep \( t \), in a closed form. This is articulated in Equation 4:

\( q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I) \)

Here, \( \alpha_t := 1 - \beta_t \) represents the proportion of the original signal retained at each step \( t \), and \( \bar{\alpha}t := \prod{s=1}^{t} \alpha_s \) is the cumulative product of \( \alpha_t \) values up to time \( t \), indicating the overall proportion of the original signal retained after \( t \) steps. This equation demonstrates that \( x_t \) is distributed according to a Gaussian with a mean dependent on the original data \( x_0 \) and a variance reflecting the total noise added up to step \( t \). This highlights the methodical transformation of data into noise, crucial for the model's training.

Closed-Form Expressions for Sampling and Training

The models exhibit the remarkable ability to sample data states at arbitrary timesteps in closed form. This efficient sampling method is derived from the sequential application of noise and is pivotal for both understanding and implementing the model.

Training Methodology and Variational Bound

Training these models involves optimizing the usual variational bound on the negative log likelihood, a process that is central to learning the reverse diffusion process. The variational bound is initially defined as \( L \) in Equation 3:

\( L = E\left[ -\log p(x_T) - \sum_{t \geq 1} \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \right] \)

Variance Reduction and Equation 5

Efficient training is achieved through variance reduction by rewriting \( L \) as a new form, represented in Equation 5:

\( L = E\left[ D_{KL}(q(x_T | x_0) \| p(x_T)) + \sum_{t > 1} D_{KL}(q(x_{t-1}|x_t, x_0) \| p_{\theta}(x_{t-1}|x_t)) - \log p_{\theta}(x_0|x_1) \right] \)

This form of the variational bound, detailed in Appendix A, focuses on minimizing KL divergence terms and the negative log likelihood, allowing for more efficient training with reduced variance.

Forward Process Posterior: Equation 7

Equation 5 in the training process uses KL divergence to compare the learned reverse process \( p_\theta(x_{t-1}|x_t) \) against the forward process posteriors, which are tractable when conditioned on \( x_0 \). The forward process posterior is given by Equation 7:

\( q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \mu_{\tilde{t}}(x_t, x_0), \beta_{\tilde{t}}I) \)

where \( \mu_{\tilde{t}}(x_t, x_0) = \sqrt{\bar{\alpha}{t-1}\beta_t} / (1 - \bar{\alpha}_t) x_0 + \sqrt{\alpha_t(1 - \bar{\alpha}{t-1})} / (1 - \bar{\alpha}t) x_t \) and \( \beta{\tilde{t}} = (1 - \bar{\alpha}_{t-1}) / (1 - \bar{\alpha}_t) \beta_t \). This formulation allows for efficient computation of the KL divergences in Equation 5 using closed-form expressions.

Efficient Calculation: Rao-Blackwellized Estimation

A notable feature in these models is the use of Rao-Blackwellized estimation for computing KL divergences efficiently. This approach avoids high-variance methods like Monte Carlo estimates, leveraging closed-form expressions for Gaussian distributions.

Forward Process and Reverse Process

Algorithm 1: Training

The training process is described by Algorithm 1:

Repeat:
Sample \( x_0 \) from the data distribution \( q(x_0) \).
Select a timestep \( t \) uniformly from {1, ..., T}.
Sample noise \( \epsilon \) from the standard normal distribution \( N(0, I) \).
Update the model parameters \( \theta \) using gradient descent on the objective:

\( \nabla_\theta \left[ \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \right]^2 \)

Continue until the model converges.

This algorithm outlines the steps for training the model, where gradient descent is used to optimize the model's ability to reverse the diffusion process.

Algorithm 2: Sampling

After training, the model is used to generate new samples through Algorithm 2:

Initialize \( x_T \) with noise, \( x_T \sim \mathcal{N}(0, I) \).
For \( t = T, \ldots, 1 \):
Sample noise \( z \sim \mathcal{N}(0, I) \) if \( t > 1 \), else set \( z = 0 \).
Compute \( x_{t-1} = \sqrt{1/\alpha_t} \left( x_t - \sqrt{1 - \alpha_t}/(1 - \bar{\alpha}t) \epsilon\theta(x_t, t) \right) + \sigma_t z \).
End for.
Return \( x_0 \) as the generated sample.

This algorithm progressively refines the noise, reversing the diffusion process to reconstruct the data.

Stochastic Gradient Langevin Dynamics and Langevin Dynamics in Sampling

Training with SGLD

In the training of diffusion models, Stochastic Gradient Langevin Dynamics (SGLD) can be employed. SGLD is an optimization method that combines aspects of stochastic gradient descent with Langevin dynamics, introducing Gaussian noise into the parameter updates. This approach helps in exploring the model's parameter space more effectively, potentially avoiding local minima and ensuring more robust convergence.

Sampling with Langevin Dynamics

Langevin dynamics, in the context of diffusion models, typically refers to the approach used during the sampling process after the model has been trained. The sampling algorithm involves a process that resembles Langevin dynamics, where noise is progressively reduced at each step to generate a sample, using the learned parameters to guide this noise reduction.

Simplified Training Objective

Recent advancements indicate that training with a simplified variant of the variational bound can lead to improved sample quality. This approach focuses directly on the noise prediction and reversal task, enhancing the model's ability to generate high-fidelity samples.

Architectural Choices and Parameterization

The architecture of diffusion models significantly impacts their performance. The choices in model design and parameterization are crucial for effectively managing the diffusion process, ensuring both stability and efficacy.

Performance and Comparison

When tested on benchmarks like CIFAR10, these models have shown exceptional performance, often outperforming other generative models. However, it's noteworthy that despite their high sample quality, they may not always achieve competitive log likelihoods compared to other likelihood-based models.

Ethical Considerations and Broader Impacts

While promising, these models raise ethical concerns, especially in their potential misuse for generating deceptive imagery. However, they also offer immense benefits in fields like data compression and creative industries.

Future Directions and Applications

Looking ahead, these models present vast potential for improvements and novel applications. Their ability to learn complex data distributions opens up avenues in various domains, from advanced image synthesis to innovative artistic creations.

In conclusion, Denoising Diffusion Probabilistic Models stand at the forefront of a new wave of image synthesis, marrying advanced probabilistic modeling with practical effectiveness. As research progresses, they are poised to redefine the boundaries of what's possible in generative modeling.

References

Denoising Diffusion Probabilistic Models. Link to paper
What are Diffusion Models?. Link to blog

Created 2023-12-09T09:13:40-08:00