Building a Diffusion Model

Here I will cover how a basic diffusion model can be built

Introduction

On this page I'm going to showcase the result form a very simple diffusion model for generating landscape images and the steps that were necessary to implement it. This is to demonstrate the idea behind how diffusion models work.

Papers on diffusion models that were used for this are the following

Diffusion Models Beat GANs on Image Synthesis, [Dhariwal, Nichol, 2021]

Denoising Diffusion Probabilistic Models, [Ho et al., 2020] ect.

Improved Denoising Diffusion Probabilistic Models, [Dhariwal, Nichol, 2021]

The model has been trained on Google Colab free tier and due to its usage limit I was not able to train the model long enough to produce very high quality results. However the results produced are sufficient for demonstrating a custom diffusion model.

For the custom model I have used a Landscape classification dataset. This data consists of 5 different classes. Each class representing a kind of landscape. These classes are the following: Coast, Desert, Forest, Glacier, Mountains.
Source: Kaggle

Overview:

Diffusion models work in a Markov chain of diffusion steps (timesteps) where they slowly add random noise to image, this is know as the forward process. Then they learn to reverse the diffusion process using a neural network to recover image from the noise, this is also called the backward process.

The task of the model can be described as predicting the noise that was added in each of the images and that is why we use a neural network for it. The papers recommend using a U-Net for this. In order to generate new unique images we can simply perform the backward process from random noise and new images points are constructed.

How to implement:

For a simple implementation mainly three things are needed

Noise Scheduler: which sequentially adds noise.
Neural Network: that predicts the noise.
Timestep Encoding: a way to encode the current timestep.

Part 1: Noise scheduler for the forwards process

First we need to create the inputs for the model which are more and more noisy images. Instead of doing this in a sequential manner, we can calculate the image for any of the timesteps individually as provided in the papers and use it for sampling. There is no model needed for this part.

Key points:

The noise-levels can be pre-computed

Note: The original paper proposed the used of a liner variance schedular for adding nose to the images, however later papers found that this destroyed the images too quickly and instead introduced a cosine-based variance schedule which prevented the image from being noised/destroyed too quickly. This is what I have used im my implementation.

def cosine_variance_schedular(timesteps, s=0.008):
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

A Simulation of a forward diffusion process:

Part 2: The Parametrized backward process with Neural Network

The authors in the papers propose to use a U-Net model for this process. I have used a very simple version of U-Net in my implementation with only 6 million parameters. It lacks many common improved such as BatchNormalization, GroupNormalization, Attention layer etc. I felt it was unnecessary for this demonstration so the model only use the main components of this architecture such as Down and Up sampling as well as some residual connections.

Key points:

The input of the mode is a noisy image, the output the noise present in the image
The Timestep is encoded by the transformer Sinusoidal Embedding in the form of positional embedding

Part 3: The loss

The last part is the loss function, diffusion models are optimized with the variational inference. To simple put we calculate the l1 or l2 distance of the predicted noise and the actual noise in the image.

The following loss function takes model, image (x_0) and timestep (t) and returns the loss of the predicted noise and the sampled noise.

def get_loss(model, x_0, t): 
    x_noisy, noise = forward_diffusion_sample(x_0, t, DEVICE)
    noise_pred = model(x_noisy, t)
    return torch.nn.functional.l1_loss(noise, noise_pred)

Sampling and Training

Finally we preform sampling before training.

The normal timesteps T used in the papers is T=1000, however the larger the number the slower the sampling time. Due to hardware limitations and since this model is just to demonstrate the concept, I chose a smaller T=300 and trained it for 500 epochs.

model.to(DEVICE)
optimizer = Adam(model.parameters(), lr=0.001)
epochs = 500
  
for epoch in range(epochs):
    pbar = tqdm(dataloader)
    for step, batch in enumerate(pbar):
      optimizer.zero_grad()
  
      t = torch.randint(0, T, (BATCH_SIZE,), device=DEVICE).long()
      loss = get_loss(model, batch[0], t)
      loss.backward()
      optimizer.step()
  
      if epoch % 10 == 0 and step == 0:
        print(f"Epoch {epoch} | step {step:03d} Loss: {loss.item()} ")
        sample_plot_image()

The following are some of the final results:

Conclusion

The resolution of the generated images is very small however after 500 epochs they do start to appear as landscape. With longer training time and improved in the U-Net the results and me further improved.

For the purpose of demonstrating a custom model, I feels the results are sufficient.