1. Introduction
Probabilistic deep learning is a branch of machine learning that combines probabilistic models and deep neural networks. Probabilistic models allow us to capture uncertainty and variability in data, while deep neural networks enable us to learn complex and high-dimensional functions from data. Probabilistic deep learning can be applied to various domains, such as computer vision, natural language processing, recommender systems, and more.
However, probabilistic deep learning also poses some challenges, such as how to perform inference in complex and intractable models. Inference is the process of computing the posterior distribution of the latent variables given the observed data and the model. The posterior distribution represents our updated beliefs about the latent variables after seeing the data. Inference is essential for learning the model parameters, making predictions, and evaluating the model performance.
Unfortunately, exact inference is often impossible or impractical in probabilistic deep learning models, due to the high dimensionality, nonlinearity, and non-conjugacy of the models. Therefore, we need to resort to approximate inference methods, which try to find a simpler and tractable distribution that is close to the true posterior distribution. One of the most popular and widely used approximate inference methods is variational inference.
Variational inference is a technique that transforms the inference problem into an optimization problem. It works by defining a family of distributions, called the variational family, and finding the member of this family that minimizes the divergence from the true posterior distribution. Variational inference has many advantages, such as being scalable, flexible, and compatible with deep neural networks.
In this blog, we will introduce the fundamentals of variational inference and explain how it works in probabilistic deep learning models. We will cover the following topics:
- Bayesian inference and variational inference
- Variational lower bound and Kullback-Leibler divergence
- Mean-field approximation and coordinate ascent
- Applications of variational inference in deep learning
By the end of this blog, you will have a solid understanding of the basic concepts and principles of variational inference, and you will be able to apply them to your own probabilistic deep learning projects. Let’s get started!
2. Bayesian Inference and Variational Inference
In this section, we will review the basics of Bayesian inference and variational inference, and how they are related. Bayesian inference is a framework for updating our beliefs about unknown variables based on observed data and prior knowledge. Variational inference is a technique for performing Bayesian inference in complex and intractable models.
Let’s start with a simple example of Bayesian inference. Suppose we have a coin that we want to test for fairness. We can model the coin as a Bernoulli random variable with parameter $\theta$, which represents the probability of getting heads. We want to infer the value of $\theta$ from a series of coin tosses. We can use the following steps to perform Bayesian inference:
- Define a prior distribution for $\theta$. This is our initial belief about the possible values of $\theta$, before seeing any data. For example, we can use a Beta distribution, which is a common choice for modeling probabilities. A Beta distribution has two parameters, $\alpha$ and $\beta$, which control the shape and location of the distribution. A Beta(1,1) distribution is equivalent to a uniform distribution, which means we have no preference for any value of $\theta$. A Beta(2,2) distribution is symmetric and peaked around 0.5, which means we have a moderate belief that the coin is fair. A Beta(10,1) distribution is skewed and concentrated around 1, which means we have a strong belief that the coin is biased towards heads.
- Collect some data from the coin tosses. This is our evidence that we use to update our beliefs about $\theta$. For example, we can toss the coin 10 times and observe the number of heads and tails. Let $x$ be the number of heads and $n$ be the total number of tosses. In our case, $n=10$ and $x$ can vary from 0 to 10.
- Compute the likelihood of the data given $\theta$. This is the probability of observing the data under different values of $\theta$. For example, we can use the binomial distribution, which models the number of successes in a fixed number of trials. The binomial distribution has two parameters, $n$ and $\theta$, where $n$ is the number of trials and $\theta$ is the probability of success. The likelihood function is given by:
$$
p(x|\theta) = \binom{n}{x} \theta^x (1-\theta)^{n-x}
$$This function tells us how likely it is to observe $x$ heads out of $n$ tosses for a given value of $\theta$. For example, if $\theta=0.5$, the likelihood of observing 5 heads and 5 tails is 0.246, while the likelihood of observing 10 heads and 0 tails is 0.001.
- Compute the posterior distribution of $\theta$ given the data. This is our updated belief about the possible values of $\theta$, after seeing the data. We can use Bayes’ theorem, which relates the prior, the likelihood, and the posterior, as follows:
$$
p(\theta|x) = \frac{p(x|\theta) p(\theta)}{p(x)}
$$This equation tells us how to update our prior distribution with the likelihood of the data to obtain the posterior distribution. The denominator, $p(x)$, is the marginal probability of the data, which can be obtained by integrating over all possible values of $\theta$. For example, if we use a Beta prior and a binomial likelihood, the posterior distribution is also a Beta distribution, with parameters $\alpha+x$ and $\beta+n-x$. This is an example of a conjugate prior, which means that the prior and the posterior belong to the same family of distributions. Conjugate priors make the computation of the posterior easier and more tractable.
- Summarize the posterior distribution of $\theta$. This is our final inference about the value of $\theta$, based on the posterior distribution. We can use different statistics to summarize the posterior distribution, such as the mean, the median, the mode, the variance, the confidence intervals, etc. For example, if we use a Beta(2,2) prior and observe 6 heads and 4 tails, the posterior distribution is a Beta(8,6) distribution, with a mean of 0.571, a median of 0.568, a mode of 0.538, a variance of 0.021, and a 95% confidence interval of [0.412, 0.718]. These statistics tell us that the most likely value of $\theta$ is around 0.57, with some uncertainty around it.
This is a simple example of how Bayesian inference works. However, in many cases, Bayesian inference is not so simple and straightforward. For example, what if we have a more complex model, with many latent variables and nonlinear relationships? What if we have a large amount of data, which makes the computation of the likelihood and the marginal probability intractable? What if we have a non-conjugate prior, which makes the computation of the posterior distribution intractable? These are some of the challenges that we face when we apply Bayesian inference to probabilistic deep learning models.
This is where variational inference comes in. Variational inference is a technique that allows us to perform approximate Bayesian inference in complex and intractable models. Variational inference works by defining a family of distributions, called the variational family, and finding the member of this family that minimizes the divergence from the true posterior distribution. Variational inference has many advantages, such as being scalable, flexible, and compatible with deep neural networks.
In the next section, we will explain how variational inference works in more detail, and introduce some key concepts and principles, such as the variational lower bound and the Kullback-Leibler divergence.
2.1. Bayesian Inference
Bayesian inference is a framework for updating our beliefs about unknown variables based on observed data and prior knowledge. Bayesian inference allows us to quantify uncertainty and variability in data, and to incorporate domain knowledge and assumptions into our models. Bayesian inference can be applied to various types of models, such as linear regression, logistic regression, Gaussian mixture models, latent Dirichlet allocation, and more.
The basic idea of Bayesian inference is to use Bayes’ theorem, which relates the prior, the likelihood, and the posterior, as follows:
$$
p(\theta|x) = \frac{p(x|\theta) p(\theta)}{p(x)}
$$
where $\theta$ are the latent variables, $x$ are the observed data, $p(\theta)$ is the prior distribution, $p(x|\theta)$ is the likelihood function, $p(\theta|x)$ is the posterior distribution, and $p(x)$ is the marginal probability of the data.
The prior distribution represents our initial belief about the possible values of the latent variables, before seeing any data. The prior distribution can be chosen based on domain knowledge, intuition, or convenience. The prior distribution can also express different degrees of certainty or uncertainty about the latent variables, by using different shapes and parameters.
The likelihood function represents the probability of observing the data under different values of the latent variables. The likelihood function depends on the model and the data. The likelihood function can also capture different types of data, such as continuous, discrete, or categorical, by using different distributions and parameters.
The posterior distribution represents our updated belief about the possible values of the latent variables, after seeing the data. The posterior distribution combines the information from the prior distribution and the likelihood function, and reflects how the data affects our beliefs about the latent variables. The posterior distribution can also be used for various purposes, such as learning the model parameters, making predictions, and evaluating the model performance.
The marginal probability of the data represents the probability of observing the data under any value of the latent variables. The marginal probability of the data can be obtained by integrating over all possible values of the latent variables. The marginal probability of the data can also be used as a normalization constant, to ensure that the posterior distribution is a valid probability distribution that sums or integrates to one.
This is the general procedure of Bayesian inference. However, in practice, Bayesian inference can be challenging and computationally expensive, especially for complex and high-dimensional models. Therefore, we need to use approximate methods, such as variational inference, to perform Bayesian inference efficiently and effectively.
In the next section, we will introduce variational inference and explain how it works in probabilistic deep learning models.
2.2. Variational Inference
Variational inference is a technique that allows us to perform approximate Bayesian inference in complex and intractable models. Variational inference works by defining a family of distributions, called the variational family, and finding the member of this family that minimizes the divergence from the true posterior distribution. Variational inference has many advantages, such as being scalable, flexible, and compatible with deep neural networks.
The basic idea of variational inference is to use optimization instead of integration to compute the posterior distribution. Instead of trying to calculate the exact posterior distribution, which may be impossible or impractical, we try to find a simpler and tractable distribution that is close to the true posterior distribution. We can measure the closeness of two distributions by using a divergence function, such as the Kullback-Leibler divergence, which we will discuss in the next section.
The variational family is a set of distributions that we can choose from to approximate the true posterior distribution. The variational family can be parametric or nonparametric, depending on the complexity and flexibility of the distributions. The variational family can also be chosen based on convenience, computational efficiency, or prior knowledge. For example, we can use a Gaussian family, which is a common choice for modeling continuous variables. A Gaussian family has two parameters, $\mu$ and $\sigma$, which control the mean and the standard deviation of the distribution. A Gaussian family is simple and tractable, but it may not be able to capture complex and multimodal distributions.
The variational inference problem can be formulated as follows:
$$
q^*(\theta) = \arg\min_{q(\theta) \in \mathcal{Q}} \mathrm{KL}(q(\theta) || p(\theta|x))
$$
where $q(\theta)$ is the variational distribution, $\mathcal{Q}$ is the variational family, and $\mathrm{KL}(q(\theta) || p(\theta|x))$ is the Kullback-Leibler divergence between the variational distribution and the true posterior distribution. The variational distribution $q^*(\theta)$ is the optimal solution that minimizes the divergence from the true posterior distribution.
The variational inference problem can be solved by using various optimization methods, such as gradient descent, coordinate ascent, stochastic optimization, etc. The optimization methods can also be combined with deep neural networks, which can learn complex and flexible variational distributions from data. This is the idea behind variational autoencoders, which are a popular and powerful application of variational inference in deep learning.
In the next section, we will explain the variational lower bound and the Kullback-Leibler divergence, which are two key concepts and principles of variational inference.
3. Variational Lower Bound and Kullback-Leibler Divergence
In this section, we will explain the variational lower bound and the Kullback-Leibler divergence, which are two key concepts and principles of variational inference. The variational lower bound is a function that measures the quality of the variational distribution, and the Kullback-Leibler divergence is a function that measures the distance between two distributions.
The variational lower bound, also known as the evidence lower bound (ELBO), is a function that lower bounds the marginal probability of the data, $p(x)$. The variational lower bound can be derived by using the following identity:
$$
\log p(x) = \mathrm{KL}(q(\theta) || p(\theta|x)) + \mathbb{E}_{q(\theta)}[\log p(x|\theta)] – \mathrm{KL}(q(\theta) || p(\theta))
$$
where $\mathrm{KL}(q(\theta) || p(\theta|x))$ is the Kullback-Leibler divergence between the variational distribution and the true posterior distribution, $\mathbb{E}_{q(\theta)}[\log p(x|\theta)]$ is the expected log-likelihood of the data under the variational distribution, and $\mathrm{KL}(q(\theta) || p(\theta))$ is the Kullback-Leibler divergence between the variational distribution and the prior distribution.
The variational lower bound can be obtained by rearranging the terms and dropping the first term, which is always non-negative, as follows:
$$
\log p(x) \geq \mathbb{E}_{q(\theta)}[\log p(x|\theta)] – \mathrm{KL}(q(\theta) || p(\theta)) = \mathrm{ELBO}(q(\theta))
$$
The variational lower bound is a function of the variational distribution, $q(\theta)$. The variational lower bound has two components: the first component is the expected log-likelihood of the data, which measures how well the variational distribution fits the data; the second component is the Kullback-Leibler divergence from the prior, which measures how much the variational distribution deviates from the prior. The variational lower bound can be interpreted as a trade-off between data fit and complexity regularization.
The variational lower bound is also related to the variational inference problem, which is to minimize the Kullback-Leibler divergence between the variational distribution and the true posterior distribution. The variational inference problem can be reformulated as maximizing the variational lower bound, as follows:
$$
q^*(\theta) = \arg\min_{q(\theta) \in \mathcal{Q}} \mathrm{KL}(q(\theta) || p(\theta|x)) = \arg\max_{q(\theta) \in \mathcal{Q}} \mathrm{ELBO}(q(\theta))
$$
This reformulation shows that maximizing the variational lower bound is equivalent to minimizing the divergence from the true posterior distribution. This also shows that the variational lower bound is a useful objective function for variational inference, as it can be optimized with respect to the variational distribution.
The Kullback-Leibler divergence, also known as the relative entropy, is a function that measures the distance or dissimilarity between two distributions, $p(\theta)$ and $q(\theta)$. The Kullback-Leibler divergence can be defined as follows:
$$
\mathrm{KL}(q(\theta) || p(\theta)) = \mathbb{E}_{q(\theta)}[\log q(\theta) – \log p(\theta)]
$$
The Kullback-Leibler divergence is a function of both distributions, $p(\theta)$ and $q(\theta)$. The Kullback-Leibler divergence has the following properties: it is always non-negative, it is zero if and only if the two distributions are equal, it is asymmetric, and it is not a true metric. The Kullback-Leibler divergence can be interpreted as the expected amount of information lost when using $q(\theta)$ to approximate $p(\theta)$.
The Kullback-Leibler divergence is also related to the variational inference problem, which is to minimize the Kullback-Leibler divergence between the variational distribution and the true posterior distribution. The Kullback-Leibler divergence can be used as a measure of the quality of the variational distribution, as it quantifies how close the variational distribution is to the true posterior distribution. The Kullback-Leibler divergence can also be used as a measure of the difficulty of the variational inference problem, as it reflects how complex and intractable the true posterior distribution is.
These are the variational lower bound and the Kullback-Leibler divergence, which are two key concepts and principles of variational inference. In the next section, we will introduce the mean-field approximation and the coordinate ascent, which are two common techniques for variational inference.
3.1. Variational Lower Bound
In the previous section, we saw how variational inference works by defining a variational family of distributions and finding the member that minimizes the divergence from the true posterior distribution. But how do we measure the divergence between two distributions? And how do we optimize the variational distribution? These are the questions that we will answer in this section, by introducing the concept of the variational lower bound.
The variational lower bound, also known as the evidence lower bound (ELBO), is a function that measures how well the variational distribution approximates the true posterior distribution. The variational lower bound is derived from the marginal probability of the data, which is the denominator of Bayes’ theorem. Recall that the marginal probability of the data is given by:
$$
p(x) = \int p(x|\theta) p(\theta) d\theta
$$
This integral is often intractable, as it involves summing or integrating over all possible values of the latent variables. However, we can use a trick to rewrite this integral in a different way, by introducing the variational distribution $q(\theta)$ and using the fact that $q(\theta) > 0$ for any $\theta$. We can multiply and divide the integrand by $q(\theta)$, and then take the logarithm of both sides, as follows:
$$
\begin{align*}
p(x) &= \int p(x|\theta) p(\theta) d\theta \\
&= \int p(x|\theta) p(\theta) \frac{q(\theta)}{q(\theta)} d\theta \\
&= \int q(\theta) \frac{p(x|\theta) p(\theta)}{q(\theta)} d\theta \\
\log p(x) &= \log \int q(\theta) \frac{p(x|\theta) p(\theta)}{q(\theta)} d\theta
\end{align*}
$$
Now, we can use a property of the logarithm, which states that $\log \int f(x) dx \geq \int f(x) \log f(x) dx$ for any positive function $f(x)$. This property is known as Jensen’s inequality, and it follows from the fact that the logarithm is a concave function. Applying Jensen’s inequality to the previous equation, we get:
$$
\begin{align*}
\log p(x) &\geq \int q(\theta) \log \frac{p(x|\theta) p(\theta)}{q(\theta)} d\theta \\
&= \int q(\theta) \log \frac{p(x,\theta)}{q(\theta)} d\theta \\
&= \mathbb{E}_{q(\theta)}[\log p(x,\theta)] – \mathbb{E}_{q(\theta)}[\log q(\theta)]
\end{align*}
$$
The last equation defines the variational lower bound, which is a lower bound on the logarithm of the marginal probability of the data. The variational lower bound depends on the choice of the variational distribution $q(\theta)$, and it has two terms: the first term is the expected log joint probability of the data and the latent variables under the variational distribution, and the second term is the entropy of the variational distribution. The first term measures how well the variational distribution fits the data and the model, and the second term measures how spread out the variational distribution is. The variational lower bound can be interpreted as a trade-off between fitting the data and the model, and being flexible and uncertain.
The variational lower bound is useful for two reasons: first, it provides a way to measure the divergence between the variational distribution and the true posterior distribution, and second, it provides a way to optimize the variational distribution. To see how, let’s rewrite the variational lower bound in another way, by using the definition of the conditional probability and the Bayes’ theorem, as follows:
$$
\begin{align*}
\log p(x) &= \log \int p(x|\theta) p(\theta) d\theta \\
&= \log \int p(\theta|x) p(x) d\theta \\
&= \log p(x) \int p(\theta|x) d\theta \\
&= \log p(x) \\
\log p(x) – \mathbb{E}_{q(\theta)}[\log p(x,\theta)] + \mathbb{E}_{q(\theta)}[\log q(\theta)] &= 0 \\
\mathbb{E}_{q(\theta)}[\log p(x,\theta)] – \mathbb{E}_{q(\theta)}[\log q(\theta)] &= \log p(x) \\
\mathbb{E}_{q(\theta)}[\log \frac{p(x,\theta)}{q(\theta)}] &= \log p(x) \\
\mathbb{E}_{q(\theta)}[\log \frac{p(\theta|x)}{q(\theta)}] &= \log p(x) – \mathbb{E}_{q(\theta)}[\log p(x|\theta)]
\end{align*}
$$
The last equation shows that the variational lower bound is equal to the difference between the log marginal probability of the data and the expected log likelihood of the data under the variational distribution. The log marginal probability of the data is a constant that does not depend on the variational distribution, and the expected log likelihood of the data is a term that we want to maximize, as it measures how well the variational distribution predicts the data. Therefore, maximizing the variational lower bound is equivalent to minimizing the difference between the variational distribution and the true posterior distribution. This difference is measured by the Kullback-Leibler divergence, which is a non-negative function that quantifies how much information is lost when using one distribution to approximate another distribution. The Kullback-Leibler divergence is defined as follows:
$$
\mathrm{KL}(q(\theta)||p(\theta|x)) = \mathbb{E}_{q(\theta)}[\log \frac{q(\theta)}{p(\theta|x)}]
$$
The Kullback-Leibler divergence is zero if and only if the variational distribution and the true posterior distribution are identical, and it is positive otherwise. Therefore, minimizing the Kullback-Leibler divergence is equivalent to making the variational distribution as close as possible to the true posterior distribution.
In summary, the variational lower bound is a function that measures how well the variational distribution approximates the true posterior distribution, and it provides a way to optimize the variational distribution by minimizing the Kullback-Leibler divergence. In the next section, we will see how to compute and optimize the variational lower bound in practice, and introduce some methods and techniques, such as mean-field approximation and coordinate ascent.
3.2. Kullback-Leibler Divergence
In the previous section, we saw how the variational lower bound measures how well the variational distribution approximates the true posterior distribution, and how it provides a way to optimize the variational distribution by minimizing the Kullback-Leibler divergence. In this section, we will explore the Kullback-Leibler divergence in more detail, and see how it relates to the concepts of information theory and entropy.
The Kullback-Leibler divergence, also known as the relative entropy, is a function that quantifies how much information is lost when using one distribution to approximate another distribution. The Kullback-Leibler divergence is defined as follows:
$$
\mathrm{KL}(q(\theta)||p(\theta|x)) = \mathbb{E}_{q(\theta)}[\log \frac{q(\theta)}{p(\theta|x)}]
$$
The Kullback-Leibler divergence is a non-negative function that is zero if and only if the two distributions are identical, and positive otherwise. The Kullback-Leibler divergence is not symmetric, which means that $\mathrm{KL}(q(\theta)||p(\theta|x)) \neq \mathrm{KL}(p(\theta|x)||q(\theta))$. The Kullback-Leibler divergence is also not a true distance metric, as it does not satisfy the triangle inequality, which states that the distance between two points is always less than or equal to the sum of the distances between those points and a third point.
The Kullback-Leibler divergence can be interpreted as the expected number of extra bits required to encode the data using the variational distribution instead of the true posterior distribution. This interpretation is based on the concepts of information theory and entropy, which we will briefly review here. Information theory is a branch of mathematics that studies the quantification, storage, and communication of information. Entropy is a measure of the uncertainty or randomness of a random variable or a distribution. The entropy of a discrete random variable $X$ with probability mass function $p(x)$ is defined as follows:
$$
H(X) = -\sum_{x} p(x) \log p(x)
$$
The entropy of a continuous random variable $X$ with probability density function $f(x)$ is defined as follows:
$$
H(X) = -\int f(x) \log f(x) dx
$$
The entropy measures the average number of bits required to encode the data using the optimal code. The optimal code is the one that assigns shorter codes to more probable outcomes and longer codes to less probable outcomes. The entropy is maximized when the distribution is uniform, which means that all outcomes are equally likely, and minimized when the distribution is deterministic, which means that only one outcome is certain. The entropy is always non-negative, and it is zero if and only if the distribution is deterministic.
The Kullback-Leibler divergence can be expressed in terms of the entropy and the cross-entropy, which is a measure of the average number of bits required to encode the data using a code based on another distribution. The cross-entropy of two discrete distributions $p(x)$ and $q(x)$ is defined as follows:
$$
H(p,q) = -\sum_{x} p(x) \log q(x)
$$
The cross-entropy of two continuous distributions $f(x)$ and $g(x)$ is defined as follows:
$$
H(f,g) = -\int f(x) \log g(x) dx
$$
The cross-entropy is always greater than or equal to the entropy, and it is equal to the entropy if and only if the two distributions are identical. The difference between the cross-entropy and the entropy is the Kullback-Leibler divergence, as follows:
$$
\begin{align*}
H(p,q) &= -\sum_{x} p(x) \log q(x) \\
&= -\sum_{x} p(x) \log \frac{q(x)}{p(x)} – \sum_{x} p(x) \log p(x) \\
&= \mathrm{KL}(p||q) + H(p) \\
H(f,g) &= -\int f(x) \log g(x) dx \\
&= -\int f(x) \log \frac{g(x)}{f(x)} dx – \int f(x) \log f(x) dx \\
&= \mathrm{KL}(f||g) + H(f)
\end{align*}
$$
Therefore, the Kullback-Leibler divergence measures the extra number of bits required to encode the data using a code based on another distribution, compared to the optimal code based on the original distribution.
In summary, the Kullback-Leibler divergence is a function that quantifies how much information is lost when using one distribution to approximate another distribution, and it can be interpreted in terms of the concepts of information theory and entropy. The Kullback-Leibler divergence is the objective function that we want to minimize in variational inference, as it measures the difference between the variational distribution and the true posterior distribution. In the next section, we will see how to compute and optimize the Kullback-Leibler divergence in practice, and introduce some methods and techniques, such as mean-field approximation and coordinate ascent.
4. Mean-Field Approximation and Coordinate Ascent
In the previous section, we learned how to perform variational inference by finding the variational distribution that minimizes the Kullback-Leibler divergence from the true posterior distribution. However, we did not specify how to choose the variational family and how to optimize the variational parameters. In this section, we will introduce one of the most common and simple choices for the variational family, called the mean-field approximation, and one of the most efficient and simple methods for optimizing the variational parameters, called coordinate ascent.
The mean-field approximation is a variational family that assumes that the latent variables are independent of each other. This means that the variational distribution can be factorized into the product of the marginal distributions of each latent variable. For example, if we have $K$ latent variables, $\mathbf{z} = (z_1, z_2, …, z_K)$, the mean-field approximation is given by:
$$
q(\mathbf{z}) = \prod_{k=1}^K q_k(z_k)
$$
where $q_k(z_k)$ is the variational distribution of the $k$-th latent variable. The mean-field approximation has two main advantages: it simplifies the computation of the variational lower bound and the Kullback-Leibler divergence, and it reduces the number of variational parameters to optimize. However, it also has a major drawback: it ignores the possible correlations and dependencies among the latent variables, which can lead to inaccurate and biased approximations.
Coordinate ascent is a method for optimizing the variational parameters that exploits the structure of the mean-field approximation. Coordinate ascent works by optimizing one variational parameter at a time, while keeping the others fixed. This way, it can find the optimal value of each variational parameter analytically, without using gradient-based methods. Coordinate ascent iterates over the variational parameters until convergence, or until a maximum number of iterations is reached. Coordinate ascent has the advantage of being fast, simple, and stable, but it also has some limitations: it can get stuck in local optima, and it requires the existence of closed-form solutions for each variational parameter.
In the next section, we will see how to apply the mean-field approximation and coordinate ascent to a simple example of variational inference, and compare the results with the exact Bayesian inference.
4.1. Mean-Field Approximation
In this section, we will introduce one of the most common and simple choices for the variational family, called the mean-field approximation. The mean-field approximation is a variational family that assumes that the latent variables are independent of each other. This means that the variational distribution can be factorized into the product of the marginal distributions of each latent variable.
For example, if we have $K$ latent variables, $\mathbf{z} = (z_1, z_2, …, z_K)$, the mean-field approximation is given by:
$$
q(\mathbf{z}) = \prod_{k=1}^K q_k(z_k)
$$
where $q_k(z_k)$ is the variational distribution of the $k$-th latent variable. The mean-field approximation has two main advantages: it simplifies the computation of the variational lower bound and the Kullback-Leibler divergence, and it reduces the number of variational parameters to optimize. However, it also has a major drawback: it ignores the possible correlations and dependencies among the latent variables, which can lead to inaccurate and biased approximations.
To illustrate the mean-field approximation, let’s consider a simple example of a probabilistic model with two latent variables, $z_1$ and $z_2$, and one observed variable, $x$. The model is defined by the following generative process:
- Draw $z_1$ from a standard normal distribution: $z_1 \sim \mathcal{N}(0,1)$
- Draw $z_2$ from a normal distribution with mean $z_1$ and unit variance: $z_2 \sim \mathcal{N}(z_1,1)$
- Draw $x$ from a normal distribution with mean $z_2$ and unit variance: $x \sim \mathcal{N}(z_2,1)$
The graphical model of this example is shown below:
The true posterior distribution of the latent variables given the observed data is:
$$
p(\mathbf{z}|x) = p(z_1, z_2|x) = \frac{p(x|z_1, z_2) p(z_1, z_2)}{p(x)}
$$
However, this distribution is intractable, because the denominator, $p(x)$, involves an integral over the latent variables that cannot be computed analytically. Therefore, we need to use variational inference to approximate this distribution.
If we use the mean-field approximation, we assume that the variational distribution of the latent variables is:
$$
q(\mathbf{z}) = q(z_1) q(z_2)
$$
where $q(z_1)$ and $q(z_2)$ are the variational distributions of $z_1$ and $z_2$, respectively. We can choose any family of distributions for $q(z_1)$ and $q(z_2)$, but a common choice is to use normal distributions, because they are easy to work with and have sufficient flexibility. Therefore, we can write:
$$
q(z_1) = \mathcal{N}(\mu_1, \sigma_1^2)
$$
$$
q(z_2) = \mathcal{N}(\mu_2, \sigma_2^2)
$$
where $\mu_1$, $\sigma_1^2$, $\mu_2$, and $\sigma_2^2$ are the variational parameters that we need to optimize. The graphical model of the mean-field approximation is shown below:
In the next section, we will see how to optimize the variational parameters using coordinate ascent, and compare the results with the exact Bayesian inference.
4.2. Coordinate Ascent
In this section, we will introduce one of the most efficient and simple methods for optimizing the variational parameters, called coordinate ascent. Coordinate ascent is a method for optimizing the variational parameters that exploits the structure of the mean-field approximation. Coordinate ascent works by optimizing one variational parameter at a time, while keeping the others fixed. This way, it can find the optimal value of each variational parameter analytically, without using gradient-based methods. Coordinate ascent iterates over the variational parameters until convergence, or until a maximum number of iterations is reached. Coordinate ascent has the advantage of being fast, simple, and stable, but it also has some limitations: it can get stuck in local optima, and it requires the existence of closed-form solutions for each variational parameter.
To illustrate coordinate ascent, let’s continue with the example of the probabilistic model with two latent variables, $z_1$ and $z_2$, and one observed variable, $x$. We have already defined the mean-field approximation as:
$$
q(\mathbf{z}) = q(z_1) q(z_2)
$$
where $q(z_1) = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q(z_2) = \mathcal{N}(\mu_2, \sigma_2^2)$ are the variational distributions of $z_1$ and $z_2$, respectively. The variational parameters that we need to optimize are $\mu_1$, $\sigma_1^2$, $\mu_2$, and $\sigma_2^2$. We can use coordinate ascent to find the optimal values of these parameters by following these steps:
- Initialize the variational parameters with some random values. For example, we can set $\mu_1 = 0$, $\sigma_1^2 = 1$, $\mu_2 = 0$, and $\sigma_2^2 = 1$.
- Optimize the variational parameter $\mu_1$ while keeping the others fixed. To do this, we need to find the value of $\mu_1$ that maximizes the variational lower bound, which is given by:
$$
\mathcal{L}(q) = \mathbb{E}_q[\log p(x, \mathbf{z})] – \mathbb{E}_q[\log q(\mathbf{z})]
$$We can use calculus to find the optimal value of $\mu_1$ by setting the derivative of $\mathcal{L}(q)$ with respect to $\mu_1$ to zero and solving for $\mu_1$. This gives us:
$$
\mu_1^* = \frac{\mu_2 + x}{2}
$$where $\mu_2$ is the current value of the variational parameter for $q(z_2)$, and $x$ is the observed data. We can plug in the values of $\mu_2$ and $x$ to obtain the optimal value of $\mu_1$. For example, if $\mu_2 = 0$ and $x = 1$, we get $\mu_1^* = 0.5$.
- Optimize the variational parameter $\sigma_1^2$ while keeping the others fixed. To do this, we need to find the value of $\sigma_1^2$ that maximizes the variational lower bound, which is given by the same equation as before. We can use calculus to find the optimal value of $\sigma_1^2$ by setting the derivative of $\mathcal{L}(q)$ with respect to $\sigma_1^2$ to zero and solving for $\sigma_1^2$. This gives us:
$$
\sigma_1^{2*} = \frac{1}{2}
$$This value does not depend on the other variational parameters or the observed data, so it is constant for any choice of $\mu_2$ and $x$. We can plug in this value to obtain the optimal value of $\sigma_1^2$. For example, we get $\sigma_1^{2*} = 0.5$.
- Optimize the variational parameter $\mu_2$ while keeping the others fixed. To do this, we need to find the value of $\mu_2$ that maximizes the variational lower bound, which is given by the same equation as before. We can use calculus to find the optimal value of $\mu_2$ by setting the derivative of $\mathcal{L}(q)$ with respect to $\mu_2$ to zero and solving for $\mu_2$. This gives us:
$$
\mu_2^* = \frac{\mu_1 + x}{2}
$$where $\mu_1$ is the current value of the variational parameter for $q(z_1)$, and $x$ is the observed data. We can plug in the values of $\mu_1$ and $x$ to obtain the optimal value of $\mu_2$. For example, if $\mu_1 = 0.5$ and $x = 1$, we get $\mu_2^* = 0.75$.
- Optimize the variational parameter $\sigma_2^2$ while keeping the others fixed. To do this, we need to find the value of $\sigma_2^2$ that maximizes the variational lower bound, which is given by the same equation as before. We can use calculus to find the optimal value of $\sigma_2^2$ by setting the derivative of $\mathcal{L}(q)$ with respect to $\sigma_2^2$ to zero and solving for $\sigma_2^2$. This gives us:
$$
\sigma_2^{2*} = \frac{1}{2}
$$This value does not depend on the other variational parameters or the observed data, so it is constant for any choice of $\mu_1$ and $x$. We can plug in this value to obtain the optimal value of $\sigma_2^2$. For example, we get $\sigma_2^{2*} = 0.5$.
- Repeat steps 2 to 5 until convergence, or until a maximum number of iterations is reached. Convergence can be checked by monitoring the change in the variational lower bound or the variational parameters. For example, we can stop the algorithm when the absolute difference between the current and the previous values of the variational lower bound is less than a small threshold, such as $10^{-6}$. Alternatively, we can stop the algorithm when the absolute difference between the current and the previous values of each variational parameter is less than a small threshold, such as $10^{-6}$. A maximum number of iterations can be set to avoid infinite loops or slow convergence. For example, we can set the maximum number of iterations to 100.
This is how coordinate ascent works for optimizing the variational parameters. In the next section, we will see how to apply coordinate ascent to our example, and compare the results with the exact Bayesian inference.
5. Applications of Variational Inference in Deep Learning
Variational inference is a powerful and versatile technique that can be applied to many probabilistic deep learning models and problems. In this section, we will briefly review some of the most common and popular applications of variational inference in deep learning, and provide some references for further reading.
One of the most prominent applications of variational inference is variational autoencoders (VAEs). VAEs are a type of generative model that can learn to produce realistic and diverse samples of complex data, such as images, text, audio, etc. VAEs consist of two components: an encoder and a decoder. The encoder takes an input data point and maps it to a latent variable, which represents a low-dimensional and meaningful representation of the data. The decoder takes a latent variable and reconstructs the original data point. The encoder and the decoder are both parametrized by deep neural networks, which can be trained end-to-end using variational inference. The key idea of variational inference in VAEs is to define a variational distribution for the latent variable, which approximates the true posterior distribution given the data and the model. The variational distribution is usually chosen to be a simple and tractable distribution, such as a Gaussian or a mixture of Gaussians. The variational distribution is also conditioned on the input data, which means that the encoder network outputs the parameters of the variational distribution. The decoder network outputs the parameters of the likelihood function, which models the distribution of the data given the latent variable. The objective function of VAEs is the variational lower bound, which balances the reconstruction error and the regularization term. The reconstruction error measures how well the decoder network can reconstruct the input data from the latent variable. The regularization term measures the divergence between the variational distribution and the prior distribution of the latent variable, which is usually chosen to be a standard Gaussian. The objective function can be optimized using stochastic gradient descent, by sampling latent variables from the variational distribution and computing the gradients of the variational lower bound. VAEs can generate new data points by sampling latent variables from the prior distribution and passing them through the decoder network. VAEs can also perform various tasks, such as data compression, denoising, inpainting, interpolation, etc. VAEs are one of the most widely used and studied generative models in deep learning, and have many extensions and variations. For more details and examples of VAEs, you can refer to the following papers and tutorials:
- Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Doersch, C. (2016). Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.
- Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
- Kingma, D. P., Rezende, D. J., Mohamed, S., & Welling, M. (2014). Semi-supervised learning with deep generative models. In Advances in neural information processing systems (pp. 3581-3589).
- Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (pp. 4743-4751).
Another important application of variational inference is Bayesian neural networks (BNNs). BNNs are a type of neural network that incorporate uncertainty and regularization into the model parameters. BNNs treat the weights and biases of the network as random variables, rather than fixed values. This allows the network to capture the uncertainty in the data and the model, and to avoid overfitting and underfitting. BNNs can also provide probabilistic predictions, rather than point estimates, which can be useful for decision making and risk assessment. BNNs can be trained using variational inference, by defining a variational distribution for the network parameters, which approximates the true posterior distribution given the data and the model. The variational distribution is usually chosen to be a factorized Gaussian, which means that each network parameter has an independent Gaussian distribution with its own mean and variance. The variational distribution is also parametrized by another neural network, called the variational network, which outputs the mean and variance of each network parameter. The objective function of BNNs is the variational lower bound, which balances the data fit and the complexity penalty. The data fit measures how well the network can predict the output given the input and the network parameters. The complexity penalty measures the divergence between the variational distribution and the prior distribution of the network parameters, which is usually chosen to be a standard Gaussian. The objective function can be optimized using stochastic gradient descent, by sampling network parameters from the variational distribution and computing the gradients of the variational lower bound. BNNs can perform various tasks, such as classification, regression, reinforcement learning, etc. BNNs are one of the most promising and challenging models in deep learning, and have many extensions and variations. For more details and examples of BNNs, you can refer to the following papers and tutorials:
- Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.
- Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning (pp. 1050-1059).
- Graves, A. (2011). Practical variational inference for neural networks. In Advances in neural information processing systems (pp. 2348-2356).
- Louizos, C., & Welling, M. (2016). Structured and efficient variational deep learning with matrix gaussian posteriors. In International conference on machine learning (pp. 1708-1716).
- Tran, D., Ranganath, R., & Blei, D. M. (2016). Variational gaussian process. In International Conference on Learning Representations.
These are just some of the many applications of variational inference in deep learning. Variational inference is a general and flexible technique that can be applied to any probabilistic model that involves latent variables and intractable posterior distributions. Variational inference can also be combined with other techniques, such as Monte Carlo methods, normalizing flows, amortized inference, etc., to improve the accuracy and efficiency of the approximation. Variational inference is an active and exciting area of research in machine learning, and we expect to see more developments and innovations in the future. For more information and resources on variational inference, you can refer to the following books and courses:
- Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859-877.
- Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2), 1-305.
- Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
- Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
- Blei, D. M. (2016). Variational inference. Columbia University.
- Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521(7553), 452-459.
We hope you enjoyed this blog and learned something new and useful about variational inference and probabilistic deep learning. In the next and final section, we will summarize the main points of this blog and provide some suggestions for further learning.
6. Conclusion
In this blog, we have introduced the fundamentals of variational inference and probabilistic deep learning. We have explained how variational inference works, and what are the key concepts and principles involved, such as the variational lower bound and the Kullback-Leibler divergence. We have also reviewed some of the most common and popular applications of variational inference in deep learning, such as variational autoencoders and Bayesian neural networks. We have provided some references and resources for further reading and learning.
Variational inference is a powerful and versatile technique that can be applied to any probabilistic model that involves latent variables and intractable posterior distributions. Variational inference can also be combined with other techniques, such as Monte Carlo methods, normalizing flows, amortized inference, etc., to improve the accuracy and efficiency of the approximation. Variational inference is an active and exciting area of research in machine learning, and we expect to see more developments and innovations in the future.
We hope you enjoyed this blog and learned something new and useful about variational inference and probabilistic deep learning. If you have any questions, comments, or feedback, please feel free to leave them in the comment section below. Thank you for reading and happy learning!