Probabilistic Deep Learning Fundamentals: Normalizing Flows

This blog introduces the concept of normalizing flows, a powerful technique for probabilistic modeling and density estimation. It explains how normalizing flows work, what types of flows exist, and how they can be used in deep learning applications.

Table of Contents

1. Introduction

Probabilistic deep learning is a branch of machine learning that deals with modeling uncertainty and learning from data using probabilistic methods. It has many applications in fields such as computer vision, natural language processing, generative modeling, reinforcement learning, and more.

One of the key challenges in probabilistic deep learning is density estimation, which is the task of learning the probability distribution of a given dataset. Density estimation can be useful for tasks such as anomaly detection, data compression, data generation, and inference.

However, density estimation is often difficult because the true distribution of the data may be complex and high-dimensional, and the standard parametric models may not be flexible enough to capture its structure. This is where normalizing flows come in handy.

Normalizing flows are a class of models that can transform a simple distribution, such as a Gaussian, into a more complex one, such as a mixture of Gaussians, by applying a sequence of invertible transformations. By doing so, they can learn to approximate any arbitrary distribution, given enough data and transformations.

In this blog, you will learn the fundamentals of normalizing flows, how they work, what types of flows exist, and how they can be used in deep learning applications. You will also see some examples of normalizing flows in action, using Python and PyTorch.

Are you ready to dive into the world of normalizing flows? Let’s get started!

2. Probabilistic Modeling and Density Estimation

Before we dive into normalizing flows, let’s first review some basic concepts of probabilistic modeling and density estimation. These concepts will help us understand the motivation and the benefits of using normalizing flows.

Probabilistic modeling is the process of building a mathematical representation of a phenomenon that involves uncertainty. For example, you may want to model the weather, the stock market, the behavior of a user, or the outcome of an experiment. Probabilistic models can help you describe, analyze, and predict these phenomena using probability theory.

One of the main goals of probabilistic modeling is to learn the probability distribution of the data, which is a function that assigns a probability to each possible outcome. For example, if you have a dataset of images of cats and dogs, you may want to learn the probability distribution that tells you how likely each image is to be a cat or a dog.

Density estimation is the task of learning the probability distribution of a given dataset from a finite number of samples. For example, if you have a dataset of 1000 images of cats and dogs, you may want to estimate the probability distribution that generated those images.

Why is density estimation useful? Here are some reasons:

It can help you understand the structure and the characteristics of the data, such as the mean, the variance, the mode, the outliers, etc.
It can help you generate new data that resembles the original data, by sampling from the estimated distribution.
It can help you perform inference, which is the process of answering questions about the data or the underlying phenomenon, such as the likelihood, the posterior, the evidence, etc.

However, density estimation is not an easy task, especially when the data is high-dimensional and complex. The true distribution of the data may be unknown, non-parametric, or intractable. This means that you cannot easily write down a formula or compute the probability of a given outcome.

How can you overcome this challenge? One way is to use normalizing flows, which are a powerful technique for transforming a simple distribution into a complex one, by applying a sequence of invertible transformations. By doing so, you can approximate any arbitrary distribution, given enough data and transformations.

In the next section, we will see how normalizing flows work, and what properties they have.

3. Normalizing Flows: Definition and Properties

In this section, we will define what normalizing flows are, and what properties they have. We will also see some examples of normalizing flows and how they can transform simple distributions into complex ones.

A normalizing flow is a sequence of invertible transformations that maps a simple distribution, such as a Gaussian, to a more complex one, such as a mixture of Gaussians. The idea is to start from a base distribution that is easy to sample from and compute probabilities with, and apply a series of transformations that change its shape and complexity, while preserving its probability mass.

Mathematically, a normalizing flow can be written as:

$$z_K = f_K \circ f_{K-1} \circ … \circ f_1 (z_0)$$

where $z_0$ is a random variable from the base distribution, $z_K$ is a random variable from the target distribution, and $f_1, …, f_K$ are invertible transformations. Each transformation $f_k$ is also called a flow layer, and the number of transformations $K$ is called the flow depth.

Why are normalizing flows useful for density estimation? Because they allow us to compute the probability of any outcome in the target distribution, using the change of variables formula. This formula relates the probability of a random variable before and after a transformation, using the determinant of the Jacobian matrix of the transformation. The Jacobian matrix is a matrix that contains the partial derivatives of the transformation with respect to the input variables.

The change of variables formula for a normalizing flow is:

$$p(z_K) = p(z_0) \prod_{k=1}^K \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|^{-1}$$

where $p(z_K)$ is the probability of the target distribution, $p(z_0)$ is the probability of the base distribution, and $\frac{\partial f_k}{\partial z_{k-1}}$ is the Jacobian matrix of the $k$-th transformation. The absolute value and the inverse are needed because the determinant can be negative or zero, and we want the probability to be positive and finite.

This formula tells us that we can compute the probability of any outcome in the target distribution, by multiplying the probability of the corresponding outcome in the base distribution, and the determinants of the Jacobian matrices of all the transformations. This is very convenient, because we can choose a base distribution that is easy to work with, such as a Gaussian, and use normalizing flows to transform it into any arbitrary distribution, while still being able to compute probabilities.

However, not all transformations are suitable for normalizing flows. They have to satisfy two properties:

They have to be invertible, meaning that we can undo them and go back to the original distribution. This is important because we want to be able to sample from the target distribution, by sampling from the base distribution and applying the transformations forward, and also compute probabilities in the target distribution, by applying the transformations backward and using the change of variables formula.
They have to have a tractable Jacobian determinant, meaning that we can compute the determinant of the Jacobian matrix efficiently and accurately. This is important because we need the determinant to compute the probability in the target distribution, and we don’t want to spend too much time or memory on this computation.

These properties limit the choice of transformations that we can use for normalizing flows. However, there are still many types of transformations that satisfy these properties, and that can create rich and diverse distributions. In the next section, we will see some examples of these transformations, and how they can be implemented in Python and PyTorch.

4. Change of Variables Formula and Invertible Transformations

In the previous section, we saw that normalizing flows are a sequence of invertible transformations that map a simple distribution to a more complex one, and that they allow us to compute the probability of any outcome in the target distribution using the change of variables formula. In this section, we will see how the change of variables formula works, and what types of invertible transformations we can use for normalizing flows.

The change of variables formula is a mathematical result that relates the probability of a random variable before and after a transformation, using the determinant of the Jacobian matrix of the transformation. The Jacobian matrix is a matrix that contains the partial derivatives of the transformation with respect to the input variables.

For example, suppose we have a random variable $x$ that follows a Gaussian distribution with mean 0 and standard deviation 1, and we apply a transformation $y = f(x) = x + 2$. This transformation shifts the distribution of $x$ by 2 units to the right, so that the mean becomes 2 and the standard deviation remains 1. How can we compute the probability of $y$ given the probability of $x$?

We can use the change of variables formula, which says:

$$p(y) = p(x) \left| \det \frac{\partial f}{\partial x} \right|^{-1}$$

where $p(y)$ is the probability of $y$, $p(x)$ is the probability of $x$, and $\frac{\partial f}{\partial x}$ is the Jacobian matrix of the transformation. In this case, the Jacobian matrix is just a scalar, because we have a one-dimensional transformation. The partial derivative of $f$ with respect to $x$ is 1, because the transformation is linear. Therefore, the determinant of the Jacobian matrix is also 1, and the inverse is 1. So, the formula simplifies to:

$$p(y) = p(x)$$

This means that the probability of $y$ is the same as the probability of $x$, which makes sense, because the transformation does not change the shape of the distribution, only the location. For example, the probability of $y = 3$ is the same as the probability of $x = 1$, because both values are one standard deviation away from the mean.

However, not all transformations are as simple as this one. Some transformations may change the shape of the distribution, as well as the location and the scale. For example, suppose we have a random variable $x$ that follows a Gaussian distribution with mean 0 and standard deviation 1, and we apply a transformation $y = f(x) = \exp(x)$. This transformation maps the distribution of $x$ to a positive exponential distribution, which has a different shape than the Gaussian distribution. How can we compute the probability of $y$ given the probability of $x$?

We can still use the change of variables formula, but now the Jacobian matrix is not a scalar, but a diagonal matrix, because we have a one-dimensional transformation. The partial derivative of $f$ with respect to $x$ is $\exp(x)$, because the transformation is exponential. Therefore, the determinant of the Jacobian matrix is also $\exp(x)$, and the inverse is $\exp(-x)$. So, the formula becomes:

$$p(y) = p(x) \exp(-x)$$

This means that the probability of $y$ is the product of the probability of $x$ and the exponential of the negative of $x$. For example, the probability of $y = 1$ is the same as the probability of $x = 0$, multiplied by the exponential of 0, which is 1. The probability of $y = 2$ is the same as the probability of $x = \ln(2)$, multiplied by the exponential of the negative of $\ln(2)$, which is $\frac{1}{2}$.

As you can see, the change of variables formula allows us to compute the probability of any outcome in the target distribution, given the probability of the corresponding outcome in the base distribution, and the determinant of the Jacobian matrix of the transformation. This is very useful for density estimation, because we can choose a base distribution that is easy to work with, such as a Gaussian, and use normalizing flows to transform it into any arbitrary distribution, while still being able to compute probabilities.

However, not all transformations are suitable for normalizing flows. They have to satisfy two properties: they have to be invertible, and they have to have a tractable Jacobian determinant. In the next section, we will see some examples of transformations that satisfy these properties, and how they can be implemented in Python and PyTorch.

5. Types of Normalizing Flows and Examples

In the previous section, we saw that normalizing flows are a sequence of invertible transformations that map a simple distribution to a more complex one, and that they allow us to compute the probability of any outcome in the target distribution using the change of variables formula. In this section, we will see some examples of transformations that satisfy the properties of normalizing flows, and how they can be implemented in Python and PyTorch.

There are many types of transformations that can be used for normalizing flows, each with its own advantages and disadvantages. Some of the most common ones are:

Affine transformations: These are linear transformations that involve scaling, shifting, rotating, or skewing the distribution. They are easy to implement and invert, but they are not very flexible, as they cannot change the topology of the distribution. For example, an affine transformation cannot turn a Gaussian distribution into a multimodal distribution.
Planar transformations: These are nonlinear transformations that involve applying a hyperplane to the distribution, and then applying a nonlinear activation function, such as a sigmoid or a tanh. They are more flexible than affine transformations, as they can create non-Gaussian shapes, but they are harder to invert and compute the Jacobian determinant.
Radial transformations: These are nonlinear transformations that involve applying a radial basis function to the distribution, which is a function that depends on the distance from a center point. They are similar to planar transformations, but they can create circular or spherical shapes, which can be useful for modeling radial symmetry.
Coupling transformations: These are nonlinear transformations that involve splitting the distribution into two parts, and then applying a different transformation to one part, depending on the value of the other part. They are very flexible and powerful, as they can create complex dependencies between the variables, but they are also harder to invert and compute the Jacobian determinant.
Autoregressive transformations: These are nonlinear transformations that involve applying a different transformation to each variable, depending on the value of the previous variables. They are similar to coupling transformations, but they use a sequential order instead of a random split. They are also very flexible and powerful, as they can model high-dimensional distributions, but they are also harder to invert and compute the Jacobian determinant.

These are just some of the types of transformations that can be used for normalizing flows. There are many more, such as spline transformations, neural spline flows, invertible convolutional transformations, and more. The choice of the transformation depends on the characteristics of the data and the desired complexity of the model.

Now, let’s see how we can implement some of these transformations in Python and PyTorch. We will use the torch.distributions.transforms module, which provides a convenient interface for defining and applying normalizing flows. We will also use the torch.distributions.TransformedDistribution class, which allows us to create a new distribution from a base distribution and a sequence of transformations.

First, let’s import the necessary modules and define some helper functions:

import torch
import torch.distributions as dist
import torch.distributions.transforms as T
import matplotlib.pyplot as plt

# A helper function to plot a 2D distribution
def plot_dist(dist, title, ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    x1 = torch.linspace(-4, 4, 100)
    x2 = torch.linspace(-4, 4, 100)
    X1, X2 = torch.meshgrid(x1, x2)
    X = torch.stack([X1, X2], dim=-1)
    p = dist.log_prob(X).exp()
    ax.contourf(X1, X2, p, cmap="Blues")
    ax.set_title(title)
    ax.set_xlabel("x1")
    ax.set_ylabel("x2")

# A helper function to sample from a 2D distribution
def sample_dist(dist, num_samples):
    samples = dist.sample((num_samples,))
    x1 = samples[:, 0]
    x2 = samples[:, 1]
    return x1, x2

Next, let’s define a base distribution, which will be a standard Gaussian distribution:

base_dist = dist.Normal(torch.zeros(2), torch.ones(2))
plot_dist(base_dist, "Base distribution")
plt.show()

Now, let’s apply an affine transformation to the base distribution, which will scale it by a factor of 2 and rotate it by 45 degrees:

affine_transform = T.AffineTransform(loc=torch.zeros(2), scale_tril=torch.tensor([[2., -2.], [2., 2.]]))
affine_dist = dist.TransformedDistribution(base_dist, [affine_transform])
plot_dist(affine_dist, "Affine transformation")
plt.show()

As you will see, the affine transformation changed the location and the scale of the distribution, but not the shape. The distribution is still Gaussian, but with a different mean and covariance matrix.

Next, let’s apply a planar transformation to the base distribution, which will apply a hyperplane and a tanh activation function:

planar_transform = T.PlanarTransform(torch.tensor([1., -1.]), torch.tensor([0.]), torch.tensor([0.]))
planar_dist = dist.TransformedDistribution(base_dist, [planar_transform])
plot_dist(planar_dist, "Planar transformation")
plt.show()

As you will see, the planar transformation changed the shape of the distribution, creating a non-Gaussian shape. The distribution is now skewed and has a tail on one side.

Next, let’s apply a radial transformation to the base distribution, which will apply a radial basis function with a center point at (1, 1):

radial_transform = T.RadialTransform(torch.tensor([1., 1.]), torch.tensor([0.]), torch.tensor([1.]))
radial_dist = dist.TransformedDistribution(base_dist, [radial_transform])
plot_dist(radial_dist, "Radial transformation")
plt.show()

As you will see, the radial transformation changed the shape of the distribution, creating a circular or spherical shape. The distribution is now symmetric and has a hole in the center.

Next, let’s apply a coupling transformation to the base distribution, which will split the distribution into two parts, and apply a different transformation to the second part, depending on the value of the first part. The transformation will be a neural network with one hidden layer and a tanh activation function:

coupling_transform = T.CouplingTransform(
    mask=torch.tensor([1, 0]), # mask the first variable
    transform_net=lambda x: torch.nn.Sequential( # define the neural network
        torch.nn.Linear(1, 2),
        torch.nn.Tanh(),
        torch.nn.Linear(2, 2)
    )(x)
)
coupling_dist = dist.TransformedDistribution(base_dist, [coupling_transform])
plot_dist(coupling_dist, "Coupling transformation")
plt.show()

As you will see, the coupling transformation changed the shape of the distribution, creating a complex and multimodal shape. The distribution is now dependent on the value of the first variable, and has two peaks on opposite sides.

Finally, let’s apply an autoregressive transformation to the base distribution, which will apply a different transformation to each variable, depending on the value of the previous variables. The transformation will be a neural network with one hidden layer and a tanh activation function:

autoregressive_transform = T.AutoregressiveTransform(
    transform_net=lambda x: torch.nn.Sequential( # define the neural network
        torch.nn.Linear(2, 2),
        torch.nn.Tanh(),
        torch.nn.Linear(2, 2)
    )(x)
)
autoregressive_dist = dist.TransformedDistribution(base_dist, [autoregressive_transform])
plot_dist(autoregressive_dist, "Autoregressive transformation")
plt.show()

6. Applications of Normalizing Flows in Deep Learning

Normalizing flows are not only a theoretical concept, but also a powerful tool for solving various problems in deep learning. In this section, we will see some examples of how normalizing flows can be used in different domains and tasks, such as generative modeling, variational inference, reinforcement learning, and more.

One of the most popular applications of normalizing flows is generative modeling, which is the task of learning to generate new data that resembles the original data. For example, you may want to generate realistic images of faces, animals, or landscapes, or synthesize natural speech or music.

Normalizing flows can be used to learn the distribution of the data and sample from it, by transforming a simple distribution, such as a Gaussian, into a complex one, such as a mixture of Gaussians. This way, you can generate diverse and high-quality samples that capture the structure and the variability of the data.

Some examples of generative models based on normalizing flows are:

Glow: A generative model that uses a series of invertible 1×1 convolutions, activation normalization, and affine coupling layers to transform images. It can generate realistic images of faces, as well as perform tasks such as image manipulation, interpolation, and super-resolution.
WaveGlow: A generative model that uses a similar architecture as Glow, but for audio signals. It can synthesize high-fidelity speech and music, as well as perform tasks such as voice conversion, style transfer, and denoising.
FFJORD: A generative model that uses a continuous-time version of normalizing flows, based on ordinary differential equations. It can model complex distributions, such as those of natural images, molecular structures, and handwritten digits.

Another application of normalizing flows is variational inference, which is the task of approximating a posterior distribution of latent variables given some observed data. For example, you may want to infer the latent factors that explain the data, such as the topics of a document, the attributes of an image, or the preferences of a user.

Normalizing flows can be used to improve the flexibility and expressiveness of the variational distribution, by transforming a simple distribution, such as a Gaussian, into a more complex one, that can better match the true posterior. This way, you can obtain more accurate and efficient inference, as well as avoid problems such as posterior collapse or mode dropping.

Some examples of variational inference models based on normalizing flows are:

IAF: A variational autoencoder (VAE) that uses an inverse autoregressive flow to transform the latent distribution. It can learn more informative and diverse latent representations, as well as generate sharper and more realistic images.
NAF: A VAE that uses a neural autoregressive flow to transform the latent distribution. It can achieve state-of-the-art results on image generation and text modeling, as well as perform tasks such as semi-supervised learning and disentanglement.
NFVI: A framework that uses normalizing flows to perform variational inference on arbitrary probabilistic models, such as Bayesian neural networks, Gaussian processes, and hidden Markov models. It can handle complex and high-dimensional posteriors, as well as perform tasks such as model selection and uncertainty quantification.

A third application of normalizing flows is reinforcement learning, which is the task of learning to optimize a reward function by interacting with an environment. For example, you may want to train an agent to play a game, control a robot, or navigate a maze.

Normalizing flows can be used to model the policy or the value function of the agent, by transforming a simple distribution, such as a Gaussian, into a more complex one, that can capture the optimal actions or values. This way, you can improve the performance and the robustness of the agent, as well as avoid problems such as overfitting or exploration-exploitation trade-off.

Some examples of reinforcement learning models based on normalizing flows are:

SNF: A policy gradient method that uses a stochastic neural flow to model the policy distribution. It can learn more expressive and adaptive policies, as well as achieve state-of-the-art results on continuous control tasks.
NFQ: A value-based method that uses a normalizing flow to model the state-action value function. It can learn more accurate and consistent value estimates, as well as perform tasks such as off-policy learning and transfer learning.
NFAC: An actor-critic method that uses a normalizing flow to model both the policy and the value function. It can learn more efficient and stable policies and values, as well as perform tasks such as multi-task learning and hierarchical reinforcement learning.

These are just some of the applications of normalizing flows in deep learning, but there are many more. Normalizing flows are a versatile and powerful technique that can be used to model, transform, and learn complex distributions in various domains and tasks.

In the next and final section, we will conclude this blog and discuss some future directions for normalizing flows research.

7. Conclusion and Future Directions

In this blog, you have learned the fundamentals of normalizing flows, a powerful technique for probabilistic modeling and density estimation. You have seen how normalizing flows work, what types of flows exist, and how they can be used in different domains and tasks, such as generative modeling, variational inference, reinforcement learning, and more.

Normalizing flows are a versatile and expressive tool that can approximate any arbitrary distribution, given enough data and transformations. They can also leverage the advantages of both parametric and non-parametric models, such as tractability, flexibility, and scalability.

However, normalizing flows are not without challenges and limitations. Some of the open problems and directions for future research are:

How to design more efficient and effective invertible transformations, that can capture complex dependencies and preserve information?
How to optimize normalizing flows in a stable and robust way, avoiding issues such as numerical instability, mode collapse, or gradient vanishing?
How to evaluate and compare normalizing flows with other methods, using appropriate metrics and benchmarks?
How to apply normalizing flows to new and emerging domains and tasks, such as natural language processing, computer vision, graph neural networks, etc.?

These are some of the questions that motivate the ongoing research on normalizing flows, and we hope that this blog has sparked your interest and curiosity to explore them further.

If you want to learn more about normalizing flows, here are some useful resources:

Normalizing Flows: An Introduction and Review of Current Methods: A comprehensive survey paper that covers the theory and the practice of normalizing flows.
Awesome Normalizing Flows: A curated list of papers, code, and resources on normalizing flows.
nflows: A PyTorch library that implements various types of normalizing flows and provides examples and tutorials.

Thank you for reading this blog, and we hope you enjoyed it. If you have any questions, comments, or feedback, please feel free to leave them below. We would love to hear from you!