Probabilistic Deep Learning Fundamentals: Variational Autoencoders

This blog introduces the concept of variational autoencoders, a probabilistic deep learning technique for generating realistic images and data samples.

Table of Contents

1. Introduction

In this blog, you will learn about one of the most popular and powerful techniques in probabilistic deep learning: variational autoencoders (VAEs). VAEs are a type of generative model that can learn to produce realistic and diverse samples of data, such as images, text, or audio.

But what makes VAEs so special? How do they work? And how can you use them to generate your own images and data samples? These are some of the questions that we will answer in this blog.

By the end of this blog, you will be able to:

Explain the basic idea and components of VAEs
Understand the role of latent variables and the reparameterization trick in VAEs
Implement and train VAEs using PyTorch
Use VAEs to generate realistic and diverse images and data samples

Before we dive into the details of VAEs, let’s first review some background concepts and terminology that will help us understand VAEs better.

2. What are Variational Autoencoders?

A variational autoencoder (VAE) is a type of neural network that can learn to generate realistic and diverse samples of data, such as images, text, or audio. VAEs belong to a broader class of models called generative models, which aim to capture the underlying distribution of the data and produce new samples from it.

But how do VAEs achieve this? What makes them different from other generative models? And what are the key components and concepts of VAEs? To answer these questions, we need to understand the basic idea and architecture of VAEs.

The basic idea of VAEs is to use a latent variable model to represent the data. A latent variable model is a probabilistic model that assumes that there is some hidden or unobserved variable that influences the observed data. For example, imagine that you have a dataset of handwritten digits, like the MNIST dataset. Each digit image can be seen as a result of some latent variable, such as the style, orientation, or thickness of the handwriting. If we could somehow infer or manipulate these latent variables, we could generate new digit images that look realistic and diverse.

However, the problem is that we don’t know what these latent variables are, or how they relate to the observed data. We can’t directly observe or measure them, and we can’t easily infer them from the data. This is where VAEs come in. VAEs are designed to learn both the latent variables and their relationship to the data in an unsupervised way, using a neural network architecture that consists of two main components: an encoder and a decoder.

2.1. The Encoder-Decoder Architecture

The encoder-decoder architecture is a common design pattern for neural networks that deal with sequential or structured data, such as natural language processing, speech recognition, or image captioning. The idea is to use two neural networks, one to encode the input data into a compact representation, and another to decode the representation into the desired output.

In the case of VAEs, the encoder and decoder networks have a specific role and structure. The encoder network takes an input data sample, such as an image, and outputs two vectors: a mean vector and a standard deviation vector. These vectors define a probability distribution over the latent variables, which are assumed to follow a multivariate normal distribution. The decoder network takes a sample from this distribution and outputs a reconstruction of the input data, which is also assumed to follow a probability distribution, such as a Bernoulli distribution for binary images.

The encoder and decoder networks can be implemented using any type of neural network, such as convolutional neural networks (CNNs) for images, recurrent neural networks (RNNs) for text, or transformers for both. The choice of the network architecture depends on the type and complexity of the data. For example, CNNs are good at capturing spatial features and patterns in images, while RNNs are good at capturing temporal dependencies and sequences in text.

The following diagram illustrates the encoder-decoder architecture of VAEs:

Source: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

In the next section, we will explain how the latent variable model works and why it is important for VAEs.

2.2. The Latent Variable Model

The latent variable model is a probabilistic model that assumes that there is some hidden or unobserved variable that influences the observed data. In the context of VAEs, the latent variable is denoted by z, and the observed data is denoted by x. The goal of the latent variable model is to learn the joint distribution of x and z, or equivalently, the conditional distribution of x given z and the prior distribution of z.

The conditional distribution of x given z is also called the likelihood, and it represents how likely the data is to be generated from the latent variable. The prior distribution of z is also called the prior, and it represents the prior knowledge or assumption about the latent variable. For example, we can assume that the latent variable follows a standard normal distribution, which means that it has zero mean and unit variance, and that it is independent of the data.

The latent variable model is useful for VAEs because it allows us to capture the complex and high-dimensional structure of the data in a lower-dimensional and simpler latent space. By sampling from the latent space, we can generate new data points that resemble the original data. However, the latent variable model also poses a challenge for VAEs, because it is difficult to infer the latent variable from the data. This is where the reparameterization trick comes in, which we will explain in the next section.

2.3. The Reparameterization Trick

The reparameterization trick is a clever technique that allows us to sample from the latent variable distribution in a differentiable way. This is important for VAEs because it enables us to use gradient-based optimization methods to train the encoder and decoder networks.

But why do we need the reparameterization trick in the first place? The reason is that the latent variable distribution is not directly accessible, but rather defined by the output of the encoder network. Recall that the encoder network outputs two vectors: a mean vector and a standard deviation vector. These vectors define a multivariate normal distribution over the latent variable. To sample from this distribution, we need to use the following formula:

$$z = \mu + \sigma \odot \epsilon$$

where z is the latent variable, μ is the mean vector, σ is the standard deviation vector, ⊙ is the element-wise product, and ε is a random vector sampled from a standard normal distribution.

The problem with this formula is that it involves a random component, ε, which is not differentiable. This means that we cannot use the chain rule to compute the gradient of the loss function with respect to the encoder network parameters. This is a problem because we want to train the encoder network to produce a good latent variable distribution that matches the prior distribution and encodes the relevant information from the data.

The reparameterization trick solves this problem by moving the randomness outside of the encoder network. Instead of sampling ε from a standard normal distribution, we sample it from a fixed normal distribution that is independent of the encoder network parameters. Then, we use the same formula to compute z as before, but now z is a deterministic function of the encoder network output and the random vector ε. This means that we can use the chain rule to compute the gradient of the loss function with respect to the encoder network parameters, by treating ε as a constant.

In the next section, we will explain how to train VAEs using the reparameterization trick and a special loss function called the evidence lower bound.

3. How to Train Variational Autoencoders?

To train VAEs, we need to define a loss function that measures how well the encoder and decoder networks perform. The loss function consists of two terms: the reconstruction loss and the Kullback-Leibler divergence. The reconstruction loss measures how well the decoder network reconstructs the input data from the latent variable. The Kullback-Leibler divergence measures how much the latent variable distribution deviates from the prior distribution. By minimizing the loss function, we can train the encoder and decoder networks to produce a good latent variable distribution and a good data reconstruction.

However, there is a problem with directly optimizing the loss function. The loss function is not the true objective that we want to maximize, which is the log-likelihood of the data. The log-likelihood is the logarithm of the probability of the data given the model parameters. It measures how well the model fits the data. The higher the log-likelihood, the better the model.

The problem is that the log-likelihood is intractable, meaning that we cannot compute it exactly. This is because the log-likelihood involves an integral over the latent variable, which we cannot evaluate analytically. To overcome this problem, we use a lower bound on the log-likelihood, which we can optimize instead. This lower bound is called the evidence lower bound (ELBO), and it is derived using a technique called variational inference.

Variational inference is a method for approximating intractable integrals using a simpler distribution. In the case of VAEs, we use the encoder network output as the simpler distribution, which we call the variational distribution. The variational distribution is also denoted by q(z|x), which means the probability of the latent variable given the data. The variational distribution is also called the inference model, because it infers the latent variable from the data.

The ELBO is defined as follows:

$$\text{ELBO} = \mathbb{E}_{q(z|x)}[\log p(x|z)] – \text{KL}(q(z|x)||p(z))$$

The ELBO consists of two terms: the expected log-likelihood and the Kullback-Leibler divergence. The expected log-likelihood is the average of the log-likelihood over the variational distribution. It measures how well the decoder network reconstructs the data from the latent variable. The Kullback-Leibler divergence is the same as the one in the loss function. It measures how much the variational distribution deviates from the prior distribution. The ELBO is a lower bound on the log-likelihood, meaning that the log-likelihood is always greater than or equal to the ELBO. Therefore, by maximizing the ELBO, we can indirectly maximize the log-likelihood.

The ELBO is also equivalent to the negative of the loss function, meaning that minimizing the loss function is the same as maximizing the ELBO. Therefore, by optimizing the loss function, we can also optimize the ELBO and the log-likelihood. This is how we train VAEs using the reparameterization trick and the ELBO.

In the next section, we will show how to generate images with VAEs using different methods of sampling and interpolation.

3.1. The Evidence Lower Bound

Now that we have seen the basic architecture and components of VAEs, let’s see how we can train them using a special objective function called the evidence lower bound (ELBO).

The ELBO is a measure of how well the VAE can reconstruct the data from the latent variables, while also keeping the latent variables close to a prior distribution, such as a standard normal distribution. The ELBO is defined as follows:

$$\text{ELBO} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] – \text{KL}(q_\phi(z|x)||p(z))$$

Here, $q_\phi(z|x)$ is the encoder network that outputs the parameters of the approximate posterior distribution of the latent variables given the data, $p_\theta(x|z)$ is the decoder network that outputs the parameters of the likelihood distribution of the data given the latent variables, and $p(z)$ is the prior distribution of the latent variables, which we assume to be a standard normal distribution. $\phi$ and $\theta$ are the parameters of the encoder and decoder networks, respectively. $\text{KL}$ stands for the Kullback-Leibler divergence, which measures the difference between two probability distributions.

The ELBO has two terms: the first term is the expected log-likelihood of the data given the latent variables, which measures how well the VAE can reconstruct the data. The second term is the KL divergence between the approximate posterior and the prior, which measures how close the latent variables are to the prior distribution. The goal of training the VAE is to maximize the ELBO, which means minimizing the reconstruction error and the KL divergence.

But how can we optimize the ELBO? How can we compute the gradients of the ELBO with respect to the network parameters? And how can we deal with the stochasticity of the latent variables? These are some of the challenges that we will address in the next section, where we will introduce the reparameterization trick, a clever technique that allows us to train VAEs efficiently and effectively.

3.2. The Kullback-Leibler Divergence

In the previous section, we saw that the ELBO has two terms: the expected log-likelihood of the data given the latent variables, and the KL divergence between the approximate posterior and the prior. In this section, we will focus on the second term, the Kullback-Leibler divergence (KL divergence), and see why it is important and how to compute it.

The KL divergence is a measure of how different two probability distributions are. It is defined as follows:

$$\text{KL}(q||p) = \mathbb{E}_{q(z)}[\log q(z) – \log p(z)]$$

Here, $q(z)$ and $p(z)$ are two probability distributions over the same variable $z$. The KL divergence is always non-negative, and it is zero if and only if $q(z) = p(z)$ for all $z$. The KL divergence is not symmetric, meaning that $\text{KL}(q||p) \neq \text{KL}(p||q)$ in general.

But why do we need the KL divergence in the ELBO? What does it do? The KL divergence acts as a regularization term that prevents the VAE from overfitting the data. It does so by enforcing that the latent variables follow a simple and smooth distribution, such as a standard normal distribution. This way, the VAE can learn a more general and robust representation of the data, rather than memorizing the specific details of each data point.

However, computing the KL divergence can be challenging, especially when the distributions $q(z|x)$ and $p(z)$ are complex and non-standard. Fortunately, there is a trick that can simplify the computation of the KL divergence for VAEs. The trick is to assume that the distributions $q(z|x)$ and $p(z)$ are both multivariate Gaussian distributions, with diagonal covariance matrices. This means that the latent variables are independent and normally distributed, with some mean and variance parameters. Under this assumption, the KL divergence has a closed-form solution, which is given by:

$$\text{KL}(q_\phi(z|x)||p(z)) = -\frac{1}{2}\sum_{j=1}^d(1 + \log \sigma_j^2 – \mu_j^2 – \sigma_j^2)$$

Here, $d$ is the dimension of the latent space, $\mu_j$ and $\sigma_j^2$ are the mean and variance parameters of the $j$-th latent variable, outputted by the encoder network. The prior distribution $p(z)$ is assumed to have zero mean and unit variance for each latent variable.

By using this formula, we can easily compute the KL divergence term of the ELBO, and use it to train the VAE. However, there is still one more challenge that we need to overcome: how to deal with the stochasticity of the latent variables, and how to compute the gradients of the ELBO with respect to the network parameters. This is where the reparameterization trick comes in handy, which we will discuss in the next section.

3.3. The Reconstruction Loss

The reconstruction loss is the other term of the ELBO, which measures how well the VAE can reconstruct the data from the latent variables. It is defined as follows:

$$\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]$$

Here, $q_\phi(z|x)$ is the encoder network that outputs the parameters of the approximate posterior distribution of the latent variables given the data, and $p_\theta(x|z)$ is the decoder network that outputs the parameters of the likelihood distribution of the data given the latent variables. $\phi$ and $\theta$ are the parameters of the encoder and decoder networks, respectively.

The reconstruction loss is the expected log-likelihood of the data given the latent variables, which means that it measures how likely the data is under the decoder network. The higher the reconstruction loss, the better the VAE can reproduce the data. The reconstruction loss depends on the choice of the likelihood distribution, which should match the type and range of the data. For example, if the data is binary, such as black-and-white images, we can use a Bernoulli distribution as the likelihood. If the data is continuous, such as color images, we can use a Gaussian distribution as the likelihood.

To compute the reconstruction loss, we need to sample the latent variables from the encoder network, and pass them to the decoder network. Then, we need to compare the output of the decoder network with the original data, and compute the log-likelihood of the data under the decoder network. However, sampling the latent variables introduces stochasticity, which makes the gradient computation difficult. To solve this problem, we use the reparameterization trick, which we will explain in the next section.

4. How to Generate Images with Variational Autoencoders?

One of the main applications and advantages of VAEs is that they can generate realistic and diverse images and data samples. In this section, we will see how we can use VAEs to generate images, and what are the different methods and techniques that we can use to control and improve the quality and diversity of the generated images.

To generate images with VAEs, we need to use the decoder network, which takes a latent variable as input and outputs the parameters of the likelihood distribution of the image. The decoder network acts as a generator that can produce images from latent variables. However, the question is: how do we choose the latent variables? Where do we get them from?

There are three main ways to obtain the latent variables for image generation: sampling from the prior distribution, interpolating in the latent space, and conditional image generation. We will explain each of these methods in the following subsections.

4.1. Sampling from the Prior Distribution

The simplest way to generate images with VAEs is to sample the latent variables from the prior distribution, which we assume to be a standard normal distribution. This means that we can randomly draw a vector of numbers from a normal distribution with zero mean and unit variance, and use it as the input for the decoder network. The decoder network will then output the parameters of the likelihood distribution of the image, which we can use to sample or reconstruct the image.

For example, suppose we have a VAE that is trained on the MNIST dataset of handwritten digits. The latent space of the VAE is two-dimensional, meaning that each latent variable is a vector of two numbers. We can sample a random vector from a normal distribution, such as [0.5, -0.3], and pass it to the decoder network. The decoder network will output the parameters of a Bernoulli distribution for each pixel of the image, which we can use to sample a binary image of a digit. The image will look realistic and diverse, as it is generated from the learned distribution of the data.

Sampling from the prior distribution is a simple and effective way to generate images with VAEs, but it has some limitations. First, it does not allow us to control or manipulate the characteristics of the generated images, such as the style, orientation, or thickness of the handwriting. Second, it does not guarantee that the generated images will be meaningful or coherent, as some regions of the latent space may not correspond to any valid image. Third, it does not take advantage of the existing data, as it does not use the encoder network or the latent variables of the data points.

To overcome these limitations, we can use other methods to obtain the latent variables for image generation, such as interpolating in the latent space, or conditional image generation. We will discuss these methods in the next subsections.

4.2. Interpolating in the Latent Space

A more interesting and creative way to generate images with VAEs is to interpolate in the latent space. This means that we can take two latent variables that correspond to two different images, and create a smooth transition between them by changing the latent variables gradually. This way, we can generate new images that combine the features of the original images, and explore the diversity and continuity of the latent space.

For example, suppose we have a VAE that is trained on the MNIST dataset of handwritten digits. The latent space of the VAE is two-dimensional, meaning that each latent variable is a vector of two numbers. We can take two latent variables that correspond to two different digits, such as [0.5, -0.3] for digit 3 and [-0.2, 0.4] for digit 7. We can then interpolate between these two latent variables by using a linear combination of them, such as [0.3, -0.1] or [0.1, 0.2]. We can pass these interpolated latent variables to the decoder network, and generate new images of digits that look like a mixture of 3 and 7.

Interpolating in the latent space is a fun and useful way to generate images with VAEs, but it also has some advantages and disadvantages. The advantages are that it allows us to control and manipulate the characteristics of the generated images, such as the style, orientation, or thickness of the handwriting. It also allows us to take advantage of the existing data, as it uses the latent variables of the data points. The disadvantages are that it requires us to have access to the encoder network, which may not be available in some scenarios. It also requires us to choose the latent variables carefully, as some combinations may not result in meaningful or coherent images.

4.3. Conditional Image Generation

A more advanced and flexible way to generate images with VAEs is to use conditional image generation. This means that we can generate images that satisfy some given conditions, such as the class, label, or attribute of the image. For example, we can generate images of digits that have a specific value, such as 0, 1, or 9. Or we can generate images of faces that have a specific feature, such as smiling, wearing glasses, or having blond hair.

How can we achieve this? How can we incorporate the conditions into the VAE model? The answer is to use a conditional variational autoencoder (CVAE), which is a variant of the VAE that can take additional inputs as conditions. The CVAE model has a similar architecture to the VAE, but with some modifications. The encoder network takes both the data and the condition as inputs, and outputs the parameters of the approximate posterior distribution of the latent variables given the data and the condition. The decoder network takes both the latent variable and the condition as inputs, and outputs the parameters of the likelihood distribution of the data given the latent variable and the condition. The prior distribution of the latent variables is also conditioned on the condition.

The CVAE model can be trained using the same ELBO objective function as the VAE, but with the additional condition inputs. The ELBO for the CVAE is defined as follows:

$$\text{ELBO} = \mathbb{E}_{q_\phi(z|x,c)}[\log p_\theta(x|z,c)] – \text{KL}(q_\phi(z|x,c)||p(z|c))$$

Here, $q_\phi(z|x,c)$ is the encoder network that outputs the parameters of the approximate posterior distribution of the latent variables given the data and the condition, $p_\theta(x|z,c)$ is the decoder network that outputs the parameters of the likelihood distribution of the data given the latent variable and the condition, and $p(z|c)$ is the prior distribution of the latent variables given the condition. $\phi$ and $\theta$ are the parameters of the encoder and decoder networks, respectively. $c$ is the condition input, which can be a vector, a label, or an image.

To generate images with the CVAE, we need to specify the condition that we want the image to satisfy, and sample the latent variable from the prior distribution given the condition. Then, we pass both the latent variable and the condition to the decoder network, and sample or reconstruct the image from the output of the decoder network. The image will look realistic and diverse, and will match the given condition.

5. Conclusion

In this blog, we have learned about the fundamentals of probabilistic deep learning and variational autoencoders. We have seen how VAEs can learn to generate realistic and diverse images and data samples, using a latent variable model and a neural network architecture. We have also seen how we can train VAEs using the evidence lower bound objective function, and how we can use the reparameterization trick to deal with the stochasticity of the latent variables. Finally, we have seen how we can generate images with VAEs, using different methods such as sampling from the prior distribution, interpolating in the latent space, and conditional image generation.

We hope that this blog has given you a clear and intuitive understanding of VAEs, and that you are inspired to explore more applications and extensions of this powerful technique. VAEs are one of the most popular and widely used methods in probabilistic deep learning, and they have many potential uses in fields such as computer vision, natural language processing, audio synthesis, and more. If you want to learn more about VAEs, we recommend you to check out the following resources:

Auto-Encoding Variational Bayes, the original paper that introduced VAEs by Kingma and Welling.
Tutorial on Variational Autoencoders, a comprehensive and accessible tutorial on VAEs by Doersch.
PyTorch VAE example, a simple and easy-to-follow implementation of VAEs using PyTorch.

Thank you for reading this blog, and we hope you enjoyed it. If you have any questions, comments, or feedback, please feel free to leave them below. We would love to hear from you and learn from your experience. Happy learning!