Part A: The Power of Diffusion Models

Setup

To gain familiarity with diffusion models, we will use the DeepFloyd IF diffusion model trained by Stability AI. Let's look at the output of the model with a series of text prompts and different inference steps using a random seed of 180.

"An oil painting of a snowy mountain village"

Image 1
inference steps = 10
Image 2
inference steps = 20
Image 3
inference steps = 30

"A man wearing a hat"

Image 1
inference steps = 10
Image 2
inference steps = 20
Image 3
inference steps = 30

"A rocket ship"

Image 1
inference steps = 10
Image 2
inference steps = 20
Image 3
inference steps = 30

We see the prompts with more detail and specification like the "oil painting of a snowy mountain village" result in a higher quality image that's standardized across different inference steps. We also see that lower inference steps despite their speed come at the cost of reduced quality in the image generated.

Sampling Loops

In this section we will be using DeepFloyd's denoisers to implement sampling loops that generate high-quality images. To sample and generate images using these models we start with pure noise at some timestep \(T\), sampled from a Gaussian distribution giving us \(x_T\). Using a diffusion model like DeepFloyd, we can reverse this process by predicting and removing the noise at each timestep t, until we get a clean image \(x_0\). We can use these and modify these sampling loops for a variety of tasks like inpainting or creating optical illusions.

Implementing the Forward Process

The forward process is a fundamental aspect of diffusion models. It takes a clean image and adds noise to it as follows: $$x_t = \sqrt{\bar{\alpha_t}}x_0 + \sqrt{1-\bar{\alpha_t}}\epsilon \text{ where } \epsilon \sim N(0,1)$$ We use the alphas_cumprod variable, \(\bar{\alpha_t}\), which contains the noise coefficients from the DeepFloyd model for our forward pass.

Below are the results at different noise levels for the Campanile:

Image 1
Campanile @ t = 250
Image 2
Campanile @ t = 500
Image 3
Campanile @ t = 750
Image 4
Campanile

Classical Denoising

Let's attempt to denoise our images using classical methods like Gaussian blur filtering.

Here are the results below:

Image 1
Noisy @ t = 250
Image 2
Noisy @ t = 500
Image 3
Noisy @ t = 750
Image 1
Gaussian Blur @ t = 250
Image 2
Gaussian Blur @ t = 500
Image 3
Gaussian Blur @ t = 750

One-Step Denoising

We see classical denoising was ineffective in the removing the noise. Let's try using the pretrained diffusion to remove the noise using one-step denoising.

Image 1
Gaussian Blur @ t = 250
Image 2
Gaussian Blur @ t = 500
Image 3
Gaussian Blur @ t = 750
Image 1
One-Step Denoising @ t = 250
Image 2
One-Step Denoising @ t = 500
Image 3
One-Step Denoising @ t = 750

Iterative Denoising

We saw the one-step denoising was effective compared to the Gaussian blur filtering. However, diffusion models are designed to denoise iteratively. We can iterate over a series of timesteps starting off at the largest \(t\) corresponding to the noisest image to the smallest \(t\) corresponding to the clean image. At each timestep we apply the following formula: $$x_{t'} = \frac{\sqrt{\bar{\alpha_t}}\beta_t}{1-\bar{\alpha_t}}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha_t})}{1-\bar{\alpha_t}}x_t + v_\sigma$$

where:

Here are the results below:

Image 1
Noisy @ t = 90
Image 2
Noisy @ t = 240
Image 3
Noisy @ t = 390
Image 4
Noisy @ t = 540
Image 4
Noisy @ t = 690
Image 1
Original
Image 2
Iteratively Denoised
Image 3
One-Step Denoised
Image 4
Gaussian Blurred

Diffusion Model Sampling

Previously, we used the diffusion model to denoise images. However, we can also use it to generate images from scratch.

Here are some samples:

Image 1
Sample 1
Image 2
Sample 2
Image 3
Sample 3
Image 4
Sample 4
Image 4
Sample 5

Classifier-Free Guidance (CFG)

To improve the quality of the images generated at the expense of image diversity, we can use classifier-free guidance. We compute a conditional and unconditional noise estimate. These are denoted as \(\epsilon_c\) and \(\epsilon_u\). We then use the following noise estimate: $$\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)$$ \(\gamma\) controls the strength of CFG. When \(\gamma\) > 1, we get higher quality images.

Below are generated images using CFG:

Image 1
Sample 1
Image 2
Sample 2
Image 3
Sample 3
Image 4
Sample 4
Image 4
Sample 5

Image-to-Image Translation

We can use varying noise levels to make edits to existing images. The more noise in image the greater the edits to the existing image will be. We're going to take our test images, noise them, and force them back into the image manifold without any conditioning. This will result an image similar to our original image depending on the noise level we start at. This approach follows the SDEdit algorithm.

Here are the results below:

Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Campanile
Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Golden Gate Bridge
Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Taj Mahal

Editing Hand-Drawn Images and Web Images

The procedure above works well if we start with nonrealistic images and then project onto the natural image manifold. Let's try with hand drawn and web images.

Here are the results:

Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
The Starry Night
Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Tree
Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Flower

Inpainting

We can use diffusion models to implement inpainting. Given an image \(x_{orig}\) and a binary mask \(\textbf{m}\), we can create a new image. This new image will have the same content as the original where \(\textbf{m}\) is 0 and new content where \(\textbf{m}\) is 1.

At every step of the diffusion denoising loop we apply the following formula: $$x_t \leftarrow \mathbf{m}x_t + (1-\mathbf{m})\text{forward}(x_{orig}, t)$$

Here are the results:

Image 1
Campanile
Image 2
Mask
Image 3
Hole to Fill
Image 4
Campanile Inpainted
Image 1
Golden Gate Bridge
Image 2
Mask
Image 3
Hole to Fill
Image 4
Golden Gate Bridge Inpainted
Image 1
Taj Mahal
Image 2
Mask
Image 3
Hole to Fill
Image 4
Taj Mahal Inpainted

Text-Conditional Image-to-Image Translation

We can also use different text prompt to guide the projection in image-to-image translation, giving us control via language.

Here are the results below using different prompts:

"A rocket ship"

Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Campanile

"A photo of the amalfi coast"

Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Golden Gate Bridge

"An oil painting of a snowy mountain village"

Image 1
i_start=1
Image 2
i_start=3
Image 3
i_start=5
Image 4
i_start=7
Image 5
i_start=10
Image 6
i_start=20
Image 7
Taj Mahal

Visual Anagrams

Using our diffusion model, we can create optical illusions like visual anagrams. We can create images that like look one way right-side up, and like another flipped upside down. To do this we run the denoising step on the following noise estimate: $$\epsilon_1 = UNet(x_t,t,p_1)$$ $$\epsilon_2 = flip(UNet(flip(x_t),t,p_2))$$ $$\epsilon = (\epsilon_1 + \epsilon_2)/2$$

where:

Here are the results:

After Image
An Oil Painting of People around a Campfire
Before Image
An Oil Painting of an Old Man
After Image
A Lithograph of Waterfalls
Before Image
An Photo of a Hipster Barista
After Image
An Oil Painting of a Snowy Mountain Village
Before Image
A Photo of the Amalfi Coast

Hybrid Images

Using our diffusion model, we can create hybrid images. Hybrid images that like look one way from up close, and like another image from far away. To do this we run the denoising step on the following noise estimate: $$\epsilon_1 = UNet(x_t,t,p_1)$$ $$\epsilon_2 = UNet(x_t,t,p_2)$$ $$\epsilon = f_{lowpass}(\epsilon_1) + f_{highpass}(\epsilon_2)$$

where:

Here are the results:

Image 1
Hybrid image of a skull and waterfall
Image 2
Hybrid image of the moon and a forest canopy
Image 3
Hybrid image of a rocket and a snowy mountain village

Part B: Diffusion Models from Scratch!

Training a Single-Step Denoising UNet

Let's start by building a simple one-step denoiser. Given a noisy image \(z\), we aim to train a denoiser \(D_{\theta}\) such that it maps \(z\) to a clean image \(x\). To do so, we can optimize over an L2 loss: $$ L = \mathbb{E}_{z,x}||D_{\theta}(z) - x||^2 $$

Implementing the UNet

We implement the UNet according to the following diagrams:

Average Dane
Unconditional UNet
Average Dane
Standard UNet Operations

Using the UNet to Train a Denoiser

To train our denoiser, we need to generate training data pairs of \((z, x)\), where \(x\) is a clean MNIST digit. For each training batch, we generate \(z\) from x using the following formula: $$z = x + \sigma\epsilon, \quad \text{where } \epsilon \sim N(0, I)$$

Here's a visualization of the noising process:

Average Dane
Varying levels of noise

Training

Now let's train our model to perform denoising. Below is the training losses curve:

Average Dane
Unconditional Training Losses

Here are the results following the 1st and 5th epochs:

After Image
1st Epoch Results
Before Image
5th Epoch Results

Out of Distribution Testing

The model was trained with MNIST digit values with \(\sigma=0.5\). Let's see how the model performed with other sigma values.

Here are the results:

Average Dane
Out-of-Distribution Testing

Training a Diffusion Model

Now we can use diffusion and train a UNet model to iteratively denoise images using the DDPM approach.

Instead of predicting the clean image directly, our UNet now predicts the noise that was added. For a noisy image \(z\), our objective function becomes:

$$ L = \mathbb{E}_{\epsilon}||e_{\theta}(z) - \epsilon||^2 $$

For diffusion, we follow a gradual process where we start with pure noise sampled from \(N(0, I)\). We then iteratively denoise the image using timesteps \(t \in \{0, 1, \cdots, T\}\). Finally, we generate noisy images at each timestep using the following equation:

$$ x_t = \sqrt{\bar{\alpha_t}}x_0 + \sqrt{1-\bar{\alpha_t}}\epsilon \quad \text{where } \epsilon \sim N(0,1). $$

We construct \(\bar{\alpha_t}\) as follows:

To denoise \(x_t\) we apply the UNet on it to get \(\epsilon\). We can condition on t since the variance of \(x_t\) varies with \(t\), giving us the our final objective function: $$L = \mathbb{E}_{x_t,t}||e_{\theta}(x_t, t) - \epsilon||^2$$

Adding Time Conditioning to UNet

We add time conditioning to our UNet model using the following diagrams:

Average Dane
Conditioned UNet
Average Dane
FC Block for Conditioning

Training the UNet

We train the time-conditioned UNet using the following algorithm:

Average Dane
Training time-conditioned UNet

Here are the training losses:

Average Dane
Time-conditioned training losses

Sampling from the UNet

We sample from the time-conditioned UNet using the following algorithm:

Average Dane
Sampling from time-conditioned UNet

Here are the results following the 5th and 20th epochs:

After Image
5th Epoch
Before Image
20th Epoch

Adding Class-Conditioning to UNet

To make the results better and give us more control for image generation, we can also condition our UNet on the class of the digit 0-9. This will require adding 2 more FCBlocks to our UNet that take in a class-conditioning vector \(c\) that one-hot encodes the digit. To ensure our UNet works without being conditioned on teh class, we implement dropout 10% of the time.

To train the class-conditioned UNet we apply the following algorithm:

Average Dane
Training class-conditioned UNet

Here are the training losses:

Average Dane
Class-conditioned training losses

Sampling from the Class-Conditioned UNet

We sample from our class-conditioned UNet as follows:

Average Dane
Sampling from class-conditioned UNet

Here are the results following the 5th and 20th epochs:

After Image
5th Epoch
Before Image
20th Epoch