Probabilistic Deep Learning Fundamentals: Gaussian Processes

This blog introduces the fundamentals of Gaussian processes, a powerful probabilistic framework for modeling complex functions and data distributions. You will learn how to choose kernel functions, perform Gaussian process regression and classification, and deal with noise and uncertainty.

Table of Contents

1. Introduction

Probabilistic deep learning is a branch of machine learning that combines the power of deep neural networks with the uncertainty quantification of probabilistic models. It allows us to build models that can learn from complex and noisy data, and also provide reliable estimates of their confidence and error.

One of the most popular and versatile probabilistic models is the Gaussian process, which is a collection of random variables that have a joint Gaussian distribution. Gaussian processes can be used to model complex functions and data distributions, and perform various tasks such as regression, classification, clustering, and dimensionality reduction.

In this blog, you will learn the fundamentals of Gaussian processes, and how to use them for probabilistic deep learning. You will learn:

What are Gaussian processes, and what are their advantages and limitations?
How to choose kernel functions, which determine the shape and behavior of Gaussian processes?
How to perform Gaussian process regression, which is the task of predicting a continuous output given some inputs?
How to perform Gaussian process classification, which is the task of predicting a discrete output given some inputs?

By the end of this blog, you will have a solid understanding of Gaussian processes, and how to apply them to your own problems. You will also be able to use some of the existing libraries and frameworks that implement Gaussian processes, such as GPflow, GPyTorch, and Pyro.

Ready to dive into the world of Gaussian processes? Let’s get started!

2. What are Gaussian processes?

A Gaussian process is a probabilistic model that defines a distribution over functions. It can be seen as a generalization of the multivariate normal distribution, where instead of having a finite number of random variables, we have infinitely many.

Why would we want to model functions as distributions? Well, imagine that you have some data points that represent the relationship between an input variable x and an output variable y. You want to find a function f that can describe this relationship, and also make predictions for new inputs. However, you don’t know the exact form of f, and you also have some uncertainty about the data points, such as measurement errors or noise. In this case, a Gaussian process can help you to capture both the uncertainty in the data and the uncertainty in the function.

How does a Gaussian process work? The main idea is that any finite subset of the function values f(x) has a joint Gaussian distribution, with a mean function m(x) and a covariance function k(x, x’). The mean function represents the expected value of the function at any point, and the covariance function represents the similarity between the function values at different points. The covariance function is also known as the kernel function, and it plays a crucial role in determining the shape and behavior of the Gaussian process.

Now, what happens when we observe some data points? The Gaussian process can be updated using the Bayes’ rule, to obtain a posterior distribution over functions that are consistent with the data. The posterior distribution can then be used to make predictions for new inputs, by computing the mean and the variance of the function values at those points. The mean represents the most likely prediction, and the variance represents the uncertainty of the prediction.

Gaussian processes are a powerful and flexible way to model functions and data distributions, and to perform probabilistic inference and prediction. However, they also have some limitations and challenges, such as computational complexity, choice of kernel functions, and scalability to large datasets. We will discuss these issues and how to overcome them in the following sections.

2.1. Definition and properties

In this section, we will formally define what a Gaussian process is, and what are some of its important properties. We will also introduce some notation and terminology that will be used throughout the blog.

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. A Gaussian distribution, also known as a normal distribution, is a probability distribution that is symmetric and bell-shaped, and is characterized by two parameters: a mean and a variance. The mean represents the expected value of the random variable, and the variance represents the spread or uncertainty of the random variable. A joint Gaussian distribution is a probability distribution over multiple random variables, that specifies the mean and the covariance of the random variables. The covariance measures the degree of linear dependence or correlation between the random variables.

A function f is a rule that maps an input x to an output y. A function can be seen as a collection of random variables, where each random variable is the output of the function at a given input. A Gaussian process is a distribution over functions, that specifies the mean and the covariance of the function values at any finite set of inputs. A Gaussian process can be written as:

$$f \sim \mathcal{GP}(m, k)$$

where m is a mean function, and k is a covariance function or a kernel function. The mean function represents the expected value of the function at any input, and the kernel function represents the similarity or correlation between the function values at different inputs. The kernel function is a function of two inputs, and it returns a scalar value that indicates how similar or correlated the function values are at those inputs. The kernel function must satisfy some mathematical properties, such as being symmetric, positive definite, and stationary. We will discuss more about kernel functions in the next section.

One way to think about a Gaussian process is as an infinite-dimensional generalization of a multivariate normal distribution. A multivariate normal distribution is a probability distribution over a finite number of random variables, that specifies the mean vector and the covariance matrix of the random variables. A Gaussian process is a probability distribution over an infinite number of random variables, that specifies the mean function and the kernel function of the random variables. A Gaussian process can be seen as a limit of a multivariate normal distribution, as the number of random variables goes to infinity.

Another way to think about a Gaussian process is as a prior distribution over functions, that encodes our beliefs or assumptions about the properties of the functions, such as smoothness, periodicity, linearity, etc. A prior distribution is a probability distribution that represents our initial knowledge or uncertainty about a quantity, before observing any data. A Gaussian process can be used as a prior distribution over functions, that reflects our prior beliefs or expectations about the functions, based on the choice of the mean function and the kernel function. A Gaussian process can then be updated with data, to obtain a posterior distribution over functions, that represents our updated knowledge or uncertainty about the functions, after observing the data. A posterior distribution is a probability distribution that is obtained by applying the Bayes’ rule, which is a formula that relates the prior distribution, the likelihood function, and the evidence. We will discuss more about the posterior distribution in the following section.

Some of the advantages of using Gaussian processes are:

They are non-parametric, meaning that they do not assume a fixed or finite number of parameters to describe the functions. They can adapt to the complexity and variability of the data, and avoid overfitting or underfitting.
They are probabilistic, meaning that they provide a measure of uncertainty or confidence for the predictions. They can account for the noise and variability in the data, and avoid overconfidence or underconfidence.
They are flexible, meaning that they can model a wide range of functions and data distributions, by choosing different mean functions and kernel functions. They can capture various properties and patterns of the functions, such as smoothness, periodicity, linearity, etc.

Some of the challenges of using Gaussian processes are:

They are computationally expensive, meaning that they require a lot of time and memory to perform inference and prediction. They have a computational complexity of O(n³) and a memory complexity of O(n²), where n is the number of data points. They can become intractable for large datasets.
They are sensitive to the choice of the kernel function, meaning that the performance and accuracy of the Gaussian process depend heavily on the kernel function. Choosing a suitable kernel function for a given problem can be difficult and requires domain knowledge and experimentation.
They are limited to Gaussian likelihoods, meaning that they assume that the data or the noise have a Gaussian distribution. This assumption may not hold for some problems, such as classification, where the outputs are discrete or categorical. In such cases, the Gaussian process needs to be modified or approximated to handle non-Gaussian likelihoods.

In the next sections, we will discuss how to overcome these challenges and how to use Gaussian processes for various tasks, such as regression and classification.

2.2. Prior and posterior distributions

In this section, we will discuss how to use a Gaussian process as a prior distribution over functions, and how to update it with data to obtain a posterior distribution over functions. We will also show how to compute the mean and the variance of the posterior distribution, and how to use them for prediction and uncertainty quantification.

A prior distribution is a probability distribution that represents our initial knowledge or uncertainty about a quantity, before observing any data. A Gaussian process can be used as a prior distribution over functions, that reflects our prior beliefs or expectations about the functions, based on the choice of the mean function and the kernel function. For example, if we choose a zero mean function and a squared exponential kernel function, we are expressing that we expect the functions to be smooth and continuous, and to vary more in regions where the kernel function is high, and less in regions where the kernel function is low.

A posterior distribution is a probability distribution that is obtained by applying the Bayes’ rule, which is a formula that relates the prior distribution, the likelihood function, and the evidence. The likelihood function is a probability distribution that represents how likely the data are given the quantity, and the evidence is a normalization constant that ensures that the posterior distribution is a valid probability distribution. The Bayes’ rule can be written as:

$$p(\text{quantity}|\text{data}) = \frac{p(\text{data}|\text{quantity})p(\text{quantity})}{p(\text{data})}$$

where p denotes the probability, and the vertical bar denotes the conditional probability. The Bayes’ rule can be interpreted as saying that the posterior distribution is proportional to the product of the likelihood function and the prior distribution, and that the evidence is the factor that makes the posterior distribution sum to one.

When we use a Gaussian process as a prior distribution over functions, and we observe some data points that represent the relationship between an input variable x and an output variable y, we can use the Bayes’ rule to update the Gaussian process and obtain a posterior distribution over functions that are consistent with the data. The posterior distribution can then be used to make predictions for new inputs, by computing the mean and the variance of the function values at those points. The mean represents the most likely prediction, and the variance represents the uncertainty of the prediction.

How do we compute the mean and the variance of the posterior distribution? The answer is that we can use some properties of the multivariate normal distribution, which are the following:

If we have a joint multivariate normal distribution over two random vectors a and b, with a mean vector μ and a covariance matrix Σ, we can write it as:

$$\begin{bmatrix}a\\b\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\mu_a\\\mu_b\end{bmatrix}, \begin{bmatrix}\Sigma_{aa} & \Sigma_{ab}\\\Sigma_{ba} & \Sigma_{bb}\end{bmatrix}\right)$$

where μ_a and μ_b are the mean vectors of a and b, and Σ_aa, Σ_ab, Σ_ba, and Σ_bb are the submatrices of the covariance matrix that correspond to the covariances between a and a, a and b, b and a, and b and b, respectively.

If we have a joint multivariate normal distribution over two random vectors a and b, and we condition on one of them, say b, we can obtain the conditional distribution of a given b, which is also a multivariate normal distribution, with a mean vector and a covariance matrix given by:

$$\begin{aligned}a|b &\sim \mathcal{N}(\mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(b – \mu_b), \Sigma_{aa} – \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})\end{aligned}$$

where Σ_bb^-1 is the inverse of the matrix Σ_bb.

If we have a joint multivariate normal distribution over two random vectors a and b, and we marginalize over one of them, say b, we can obtain the marginal distribution of a, which is also a multivariate normal distribution, with a mean vector and a covariance matrix given by:

$$\begin{aligned}a &\sim \mathcal{N}(\mu_a, \Sigma_{aa})\end{aligned}$$

Using these properties, we can compute the mean and the variance of the posterior distribution of a Gaussian process, as follows:

Let f be a function that is drawn from a Gaussian process prior distribution with a mean function m and a kernel function k, and let y be a vector of observed outputs at some inputs X. We can write the joint distribution of f and y as:

$$\begin{bmatrix}f\\y\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}m(X)\\m(X)\end{bmatrix}, \begin{bmatrix}k(X, X) & k(X, X)\\k(X, X) & k(X, X) + \sigma^2I\end{bmatrix}\right)$$

where m(X) is the vector of the mean function values at X, k(X, X) is the matrix of the kernel function values at X and X, σ² is the variance of the Gaussian noise, and I is the identity matrix.

Let f_* be a vector of function values at some new inputs X_*, and let y_* be a vector of outputs at X_*. We can write the joint distribution of f_* and y_* given y as:

$$\begin{aligned}f_*|y &\sim \mathcal{N}(m(X_*) + k(X_*, X)[k(X, X) + \sigma^2I]^{-1}(y – m(X)),\\ &k(X_*, X_*) – k(X_*, X)[k(X, X) + \sigma^2I]^{-1}k(X, X_*))\\y_*|y &\sim \mathcal{N}(m(X_*) + k(X_*, X)[k(X, X) + \sigma^2I]^{-1}(y – m(X)),\\ &k(X_*, X_*) + \sigma^2I – k(X_*, X)[k(X, X) + \sigma^2I]^{-1}k(X, X_*))\end{aligned}$$

where m(X_*) is the vector of the mean function values at X_*, k(X_*, X_*) is the matrix of the kernel function values at X_* and X_*, and k(X_*, X) is the matrix of the kernel function values at X_* and X.

The mean and the variance of the posterior distribution of f_* given y are given by the mean vector and the covariance matrix of the conditional distribution of f_* given y, which are:

$$\begin{aligned}\text{mean}(f_*|y) &= m(X_*) + k(X_*, X)[k(X, X) + \sigma^2I]^{-1}(y – m(X))\\\text{var}(f_*|y) &= k(X_*, X_*) – k(X_*, X)[k(X, X) + \sigma^2I]^{-1}k(X, X_*)\end{aligned}$$

The mean and the variance of the posterior distribution of

3. How to choose kernel functions?

As we have seen in the previous section, kernel functions are essential for defining Gaussian processes. They determine how similar the function values are at different points, and thus affect the shape and behavior of the Gaussian process. Choosing a suitable kernel function is therefore a crucial step in probabilistic deep learning.

But how do we choose a kernel function? There are many possible kernel functions, each with different characteristics and assumptions. Some of the most common kernel functions are:

The squared exponential kernel, also known as the radial basis function (RBF) kernel, which is smooth and infinitely differentiable. It assumes that the function is stationary, meaning that its properties do not change over the input space.
The Matérn kernel, which is a generalization of the squared exponential kernel, with an additional parameter that controls the smoothness of the function. It can model functions that are less smooth or more irregular than the squared exponential kernel.
The periodic kernel, which assumes that the function has a periodic pattern, with a fixed period and amplitude. It can model functions that repeat themselves over time or space, such as seasonal or cyclical phenomena.
The linear kernel, which assumes that the function is linear, meaning that it has a constant slope and intercept. It can model functions that have a linear relationship between the input and the output, such as linear regression.
The polynomial kernel, which is a generalization of the linear kernel, with an additional parameter that controls the degree of the polynomial. It can model functions that have a nonlinear relationship between the input and the output, such as polynomial regression.

These are just some examples of kernel functions, and there are many more to choose from. You can also combine different kernel functions to create more complex and expressive kernels, such as adding, multiplying, or scaling them. For example, you can use a sum of a periodic kernel and a squared exponential kernel to model a function that has both a periodic and a smooth component.

So, how do you decide which kernel function to use for your problem? There is no definitive answer to this question, as different kernel functions may work better or worse depending on the data and the task. However, some general guidelines are:

Choose a kernel function that reflects your prior knowledge or assumptions about the function you want to model. For example, if you know that the function is periodic, use a periodic kernel. If you know that the function is smooth, use a squared exponential kernel.
Choose a kernel function that is flexible enough to capture the complexity and variability of the function, but not too flexible that it overfits the data or becomes computationally expensive. For example, if the function is simple and linear, use a linear kernel. If the function is complex and nonlinear, use a polynomial kernel.
Choose a kernel function that is interpretable and explainable, meaning that you can understand how it affects the Gaussian process and the predictions. For example, avoid using too many or too complicated kernel functions, as they may make the Gaussian process difficult to analyze or visualize.

Choosing a kernel function is not an easy task, and it may require some trial and error and experimentation. Fortunately, there are some methods and tools that can help you to choose and optimize kernel functions, such as hyperparameter optimization and model selection. We will discuss these methods and tools in the next section.

3.1. Common kernels and their characteristics

In this section, we will explore some of the common kernel functions that are used for Gaussian processes, and their characteristics. We will also see how to implement them using Python and some of the popular libraries for Gaussian processes, such as GPflow, GPyTorch, and Pyro.

As a reminder, a kernel function is a function that takes two inputs, x and x’, and returns a measure of similarity between them, denoted by k(x, x’). A kernel function must be symmetric, meaning that k(x, x’) = k(x’, x), and positive definite, meaning that the matrix of kernel values for any finite set of inputs must be positive semidefinite. A kernel function can be seen as a dot product between two feature vectors, ϕ(x) and ϕ(x’), in some high-dimensional or infinite-dimensional feature space, such that k(x, x’) = ϕ(x) · ϕ(x’). However, we do not need to explicitly compute the feature vectors, as the kernel function can implicitly represent them.

Let’s look at some examples of kernel functions, and how they affect the Gaussian process.

Squared exponential kernel

The squared exponential kernel, also known as the radial basis function (RBF) kernel, is one of the most widely used kernel functions for Gaussian processes. It is defined as:

$$k(x, x’) = \sigma^2 \exp \left( -\frac{1}{2} \frac{(x – x’)^2}{\ell^2} \right)$$

where σ² is the variance parameter, which controls the amplitude of the function, and ℓ is the length scale parameter, which controls the smoothness of the function. The squared exponential kernel is infinitely differentiable, meaning that it can model very smooth functions. It also assumes that the function is stationary, meaning that its properties do not change over the input space.

Here is an example of a Gaussian process with a squared exponential kernel, and its implementation in Python using GPflow:

# Import GPflow and other libraries
import gpflow
import numpy as np
import matplotlib.pyplot as plt

# Define the kernel function
kernel = gpflow.kernels.SquaredExponential(variance=1.0, lengthscale=0.5)

# Define the input range
X = np.linspace(0, 10, 100)[:, None]

# Sample from the prior distribution
f = np.random.multivariate_normal(mean=np.zeros(100), cov=kernel(X))

# Plot the function
plt.plot(X, f)
plt.title("Gaussian process with squared exponential kernel")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()

3.2. Hyperparameter optimization and model selection

Once you have chosen a kernel function for your Gaussian process, you may wonder how to set its parameters, such as the variance and the length scale. These parameters are also known as hyperparameters, and they can have a significant impact on the performance and accuracy of your Gaussian process. Hyperparameter optimization is the process of finding the optimal values of the hyperparameters that maximize the likelihood of the data given the Gaussian process.

There are different methods and algorithms for hyperparameter optimization, such as grid search, random search, Bayesian optimization, gradient-based optimization, and evolutionary algorithms. The choice of the method depends on the complexity and dimensionality of the hyperparameter space, the computational resources available, and the desired level of accuracy and robustness. Some of the popular libraries for hyperparameter optimization are Scikit-Optimize, Optuna, and Hyperopt.

Here is an example of hyperparameter optimization for a Gaussian process with a squared exponential kernel, using GPflow and Scikit-Optimize:

# Import GPflow, Scikit-Optimize, and other libraries
import gpflow
import skopt
import numpy as np
import matplotlib.pyplot as plt

# Define the kernel function
kernel = gpflow.kernels.SquaredExponential()

# Define the Gaussian process model
model = gpflow.models.GPR(data=(X, y), kernel=kernel, noise_variance=1)

# Define the objective function to minimize
def objective(params):
    # Update the model parameters
    model.kernel.variance.assign(params[0])
    model.kernel.lengthscale.assign(params[1])
    model.likelihood.variance.assign(params[2])
    
    # Return the negative log marginal likelihood
    return - model.log_marginal_likelihood().numpy()

# Define the hyperparameter space
space = [(1e-2, 1e2), # variance
         (1e-2, 1e2), # lengthscale
         (1e-2, 1e2)] # noise variance

# Run the optimization
result = skopt.gp_minimize(objective, space, n_calls=20)

# Print the optimal values
print("Optimal values:")
print("variance = {:.4f}".format(result.x[0]))
print("lengthscale = {:.4f}".format(result.x[1]))
print("noise variance = {:.4f}".format(result.x[2]))

Another question that you may have is how to compare different kernel functions, or different combinations of kernel functions, and choose the best one for your problem. This is known as model selection, and it involves evaluating the performance and accuracy of different models using some criteria or metrics. Some of the common criteria or metrics for model selection are:

The log marginal likelihood, which is the probability of the data given the model. It measures how well the model fits the data, and also penalizes the model complexity. A higher log marginal likelihood indicates a better model.
The Akaike information criterion (AIC), which is defined as AIC = 2k – 2 log L, where k is the number of parameters and L is the likelihood of the data given the model. It measures the trade-off between the model fit and the model complexity. A lower AIC indicates a better model.
The Bayesian information criterion (BIC), which is defined as BIC = k log n – 2 log L, where k is the number of parameters, n is the number of data points, and L is the likelihood of the data given the model. It measures the trade-off between the model fit and the model complexity, with a stronger penalty for the number of parameters. A lower BIC indicates a better model.
The cross-validation score, which is the average of the model performance on different subsets of the data that are used for training and testing. It measures how well the model generalizes to new data, and avoids overfitting. A higher cross-validation score indicates a better model.

Here is an example of model selection for a Gaussian process, using GPflow and Scikit-Learn:

# Import GPflow, Scikit-Learn, and other libraries
import gpflow
import sklearn
import numpy as np
import matplotlib.pyplot as plt

# Define the kernel functions to compare
kernels = [gpflow.kernels.SquaredExponential(),
           gpflow.kernels.Matern52(),
           gpflow.kernels.Periodic(),
           gpflow.kernels.Linear()]

# Define the Gaussian process models
models = [gpflow.models.GPR(data=(X, y), kernel=k, noise_variance=1) for k in kernels]

# Define the model selection criteria
criteria = ["log_marginal_likelihood", "AIC", "BIC", "cross_val_score"]

# Define the cross-validation method
cv = sklearn.model_selection.KFold(n_splits=5)

# Evaluate the models using the criteria
scores = []
for model in models:
    score = {}
    for criterion in criteria:
        if criterion == "log_marginal_likelihood":
            # Use the GPflow method
            score[criterion] = model.log_marginal_likelihood().numpy()
        elif criterion == "AIC":
            # Use the GPflow method
            score[criterion] = model.AIC().numpy()
        elif criterion == "BIC":
            # Use the GPflow method
            score[criterion] = model.BIC().numpy()
        elif criterion == "cross_val_score":
            # Use the Scikit-Learn method
            # Define a wrapper function for compatibility
            def fit_predict(X, y, X_test):
                model.data = (X, y)
                model.optimize()
                mean, var = model.predict_f(X_test)
                return mean.numpy()
            # Compute the cross-validation score
            score[criterion] = np.mean(sklearn.model_selection.cross_val_score(fit_predict, X, y, cv=cv, scoring="neg_mean_squared_error"))
        else:
            # Invalid criterion
            raise ValueError("Invalid criterion: {}".format(criterion))
    scores.append(score)

# Print the scores for each model and criterion
for i, model in enumerate(models):
    print("Model {} with kernel {}:".format(i+1, model.kernel))
    for criterion in criteria:
        print("{} = {:.4f}".format(criterion, scores[i][criterion]))
    print()

4. How to perform Gaussian process regression?

Gaussian process regression is the task of predicting a continuous output variable y given some input variables x, using a Gaussian process as the prior distribution over functions. In this section, you will learn how to perform Gaussian process regression, and how to implement it using Python and GPflow.

The basic algorithm for Gaussian process regression is as follows:

Choose a mean function m(x) and a kernel function k(x, x’) for the Gaussian process prior.
Obtain some training data D = {(x_i, y_i)} that represent the observed function values.
Compute the posterior distribution over functions using the Bayes’ rule, which is also a Gaussian process with an updated mean function m_post(x) and covariance function k_post(x, x’).
Make predictions for new inputs x_* by computing the mean and the variance of the posterior distribution at those points.

Let’s see how to implement this algorithm using Python and GPflow, a library that provides a high-level interface for Gaussian processes. First, we need to import some modules and set up the random seed:

import numpy as np
import matplotlib.pyplot as plt
import gpflow
import tensorflow as tf
tf.random.set_seed(42)

Next, we need to choose a mean function and a kernel function for the Gaussian process prior. For simplicity, we will use a zero mean function and a squared exponential kernel function, which is also known as the radial basis function (RBF) kernel. We can create these objects using GPflow:

mean_function = gpflow.mean_functions.Zero()
kernel_function = gpflow.kernels.RBF()

Then, we need to obtain some training data that represent the observed function values. For this example, we will use a synthetic dataset that is generated by adding some Gaussian noise to a sinusoidal function. We can create this dataset using NumPy:

# Generate some input points
X = np.linspace(0, 10, 100)[:, None]
# Generate some output points by adding noise to a sinusoidal function
Y = np.sin(X) + 0.1 * np.random.randn(100, 1)
# Plot the dataset
plt.scatter(X, Y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Training data')
plt.show()

Finally, we need to compute the posterior distribution over functions and make predictions for new inputs. We can do this by creating a GPflow model object that takes the mean function, the kernel function, and the training data as arguments. The model object has a method called predict_f that returns the mean and the variance of the posterior distribution at some test points. We can use this method to make predictions for some new inputs and plot the results:

# Create a GPflow model object
model = gpflow.models.GPR(data=(X, Y), mean_function=mean_function, kernel=kernel_function)
# Generate some test points
X_test = np.linspace(-2, 12, 200)[:, None]
# Make predictions using the model
mean, var = model.predict_f(X_test)
# Plot the predictions and the confidence interval
plt.plot(X_test, mean, 'C0', lw=2)
plt.fill_between(X_test[:, 0], mean[:, 0] - 1.96 * np.sqrt(var[:, 0]), mean[:, 0] + 1.96 * np.sqrt(var[:, 0]), color='C0', alpha=0.2)
plt.scatter(X, Y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Gaussian process regression')
plt.show()

As you can see, the Gaussian process regression model can capture the underlying trend of the data, and also provide a measure of uncertainty for the predictions. The confidence interval is narrower near the observed points, and wider away from them. You can also try different mean functions and kernel functions, and see how they affect the shape and behavior of the Gaussian process.

In this section, you learned how to perform Gaussian process regression, and how to implement it using Python and GPflow. In the next section, you will learn how to perform Gaussian process classification, which is the task of predicting a discrete output variable given some input variables.

4.1. The basic algorithm and its implementation

In this section, you will learn the basic algorithm for Gaussian process regression, and how to implement it using Python and GPflow. Gaussian process regression is the task of predicting a continuous output variable y given some input variables x, using a Gaussian process as the prior distribution over functions.

The basic algorithm for Gaussian process regression is as follows:

Choose a mean function m(x) and a kernel function k(x, x’) for the Gaussian process prior.
Obtain some training data D = {(x_i, y_i)} that represent the observed function values.
Compute the posterior distribution over functions using the Bayes’ rule, which is also a Gaussian process with an updated mean function m_post(x) and covariance function k_post(x, x’).
Make predictions for new inputs x_* by computing the mean and the variance of the posterior distribution at those points.

import numpy as np
import matplotlib.pyplot as plt
import gpflow
import tensorflow as tf
tf.random.set_seed(42)

mean_function = gpflow.mean_functions.Zero()
kernel_function = gpflow.kernels.RBF()

# Generate some input points
X = np.linspace(0, 10, 100)[:, None]
# Generate some output points by adding noise to a sinusoidal function
Y = np.sin(X) + 0.1 * np.random.randn(100, 1)
# Plot the dataset
plt.scatter(X, Y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Training data')
plt.show()

# Create a GPflow model object
model = gpflow.models.GPR(data=(X, Y), mean_function=mean_function, kernel=kernel_function)
# Generate some test points
X_test = np.linspace(-2, 12, 200)[:, None]
# Make predictions using the model
mean, var = model.predict_f(X_test)
# Plot the predictions and the confidence interval
plt.plot(X_test, mean, 'C0', lw=2)
plt.fill_between(X_test[:, 0], mean[:, 0] - 1.96 * np.sqrt(var[:, 0]), mean[:, 0] + 1.96 * np.sqrt(var[:, 0]), color='C0', alpha=0.2)
plt.scatter(X, Y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Gaussian process regression')
plt.show()

In this section, you learned the basic algorithm for Gaussian process regression, and how to implement it using Python and GPflow. In the next section, you will learn how to deal with noise and uncertainty in Gaussian process regression.

4.2. Dealing with noise and uncertainty

In the previous section, you learned the basic algorithm for Gaussian process regression, and how to implement it using Python and GPflow. However, in real-world scenarios, the observed function values may not be exact, but rather corrupted by some noise or uncertainty. In this section, you will learn how to deal with noise and uncertainty in Gaussian process regression, and how to modify the model accordingly.

The simplest way to account for noise and uncertainty in Gaussian process regression is to assume that the observed output variable y is related to the latent function value f by some additive Gaussian noise:

$$y = f + \epsilon$$

where \epsilon is a zero-mean Gaussian random variable with variance \sigma². This means that the likelihood function of the data given the function is also Gaussian:

$$p(y|f) = \mathcal{N}(y|f, \sigma^2)$$

This assumption simplifies the computation of the posterior distribution over functions, as it remains a Gaussian process with the same mean function as the prior, but with an increased covariance function:

$$m_{post}(x) = m(x)$$
$$k_{post}(x, x’) = k(x, x’) + \sigma^2 \delta(x, x’)$$

where \delta(x, x’) is the Kronecker delta function, which is 1 if x = x’ and 0 otherwise. The noise variance \sigma² is a hyperparameter that can be estimated from the data, along with the other hyperparameters of the kernel function.

How does this affect the predictions for new inputs? The mean of the posterior distribution remains the same as before, but the variance is increased by \sigma². This means that the predictions are more uncertain, and the confidence interval is wider. For example, consider the following plot of a Gaussian process regression model with the same kernel function and dataset as before.

The confidence interval is wider than before, especially near the observed points, where the noise is more evident. You can also try different values of the noise variance, and see how they affect the predictions.

To implement this model using Python and GPflow, we only need to modify one line of code. Instead of creating a GPflow model object with the default likelihood function, which is Gaussian with zero noise variance, we need to specify the noise variance as an argument. For example, to set the noise variance to 0.1, we can do the following:

# Create a GPflow model object with a specified noise variance
model = gpflow.models.GPR(data=(X, Y), mean_function=mean_function, kernel=kernel_function, noise_variance=0.1)

The rest of the code remains the same as before, and we can make predictions and plot the results as usual.

In this section, you learned how to deal with noise and uncertainty in Gaussian process regression, and how to modify the model accordingly. In the next section, you will learn how to perform Gaussian process classification, which is the task of predicting a discrete output variable given some input variables.

5. How to perform Gaussian process classification?

Gaussian process classification is the task of predicting a discrete output variable y given some input variable x, using a Gaussian process prior over the latent function f that relates x and y. For example, you may want to classify an email as spam or not spam, based on some features of the email.

Unlike Gaussian process regression, where the output variable is continuous and Gaussian distributed, Gaussian process classification involves a non-Gaussian likelihood function that maps the latent function f to the output variable y. For example, if y is binary, you may use a Bernoulli likelihood function, which models the probability of y as a logistic function of f:

$$p(y=1|x,f) = \frac{1}{1+\exp(-f(x))}$$

The problem with this approach is that the posterior distribution over f given the data is no longer Gaussian, and therefore, it is not analytically tractable. This means that we cannot compute the exact mean and variance of the predictions, as we did in Gaussian process regression.

So, how can we perform Gaussian process classification in practice? There are two main methods that can approximate the posterior distribution over f and the predictive distribution over y: the Laplace approximation and the expectation propagation method. We will briefly describe each of these methods in the following subsections.

5.1. The challenges and solutions

Gaussian process classification poses some challenges that make it more difficult than Gaussian process regression. The main challenge is that the posterior distribution over the latent function f is not Gaussian anymore, due to the non-Gaussian likelihood function. This means that we cannot use the same formulas that we used for Gaussian process regression, and we need to find some way to approximate the posterior distribution.

Another challenge is that the predictive distribution over the output variable y is not analytically tractable either, since it involves integrating over the posterior distribution of f. This means that we cannot compute the exact probability of y given a new input x, and we need to use some numerical methods to estimate it.

How can we overcome these challenges and perform Gaussian process classification effectively? There are two main methods that can help us to approximate the posterior and the predictive distributions: the Laplace approximation and the expectation propagation method. These methods are based on different techniques to approximate the true distributions with simpler ones, such as Gaussian or exponential family distributions. We will explain how these methods work and how to implement them in the next subsections.

5.2. The Laplace approximation and the expectation propagation methods

The Laplace approximation and the expectation propagation methods are two techniques that can approximate the posterior distribution over the latent function f and the predictive distribution over the output variable y in Gaussian process classification. They are based on different assumptions and approximations, and they have different advantages and disadvantages. We will briefly describe how each of these methods works and how to implement them using some of the existing libraries and frameworks.

The Laplace approximation

The Laplace approximation is a method that approximates the posterior distribution over f with a Gaussian distribution that has the same mode and curvature as the true distribution. The mode is the value of f that maximizes the posterior probability, and the curvature is the inverse of the Hessian matrix of the negative log-posterior at the mode. The Laplace approximation can be computed using an iterative algorithm that alternates between finding the mode and computing the curvature, until convergence.

Once the Laplace approximation is obtained, the predictive distribution over y can be approximated by integrating over the Gaussian approximation of f, using some numerical methods such as quadrature or Monte Carlo sampling. The predictive mean and variance can then be computed from the predictive distribution.

The Laplace approximation is relatively simple and fast to compute, and it can handle any likelihood function that is twice differentiable. However, it also has some limitations, such as being sensitive to the choice of the initial value of f, and being inaccurate when the posterior distribution over f is skewed or multimodal.

To implement the Laplace approximation, you can use some of the existing libraries and frameworks that support Gaussian processes, such as GPflow, GPyTorch, and Pyro. For example, in GPflow, you can use the gpflow.models.VGP class to create a variational Gaussian process model with a Laplace approximation, and then use the gpflow.optimizers.Scipy class to optimize the model parameters. You can also use the gpflow.models.VGP.predict_y method to make predictions for new inputs.

The expectation propagation method

The expectation propagation method is a method that approximates the posterior distribution over f with a Gaussian distribution that has the same marginal moments as the true distribution. The marginal moments are the mean and variance of each element of f, which can be computed by integrating over the rest of the elements. The expectation propagation method can be computed using an iterative algorithm that updates the Gaussian approximation by minimizing the Kullback-Leibler divergence between the true and the approximate distributions, until convergence.

Once the expectation propagation method is obtained, the predictive distribution over y can be approximated by integrating over the Gaussian approximation of f, using the same numerical methods as before. The predictive mean and variance can then be computed from the predictive distribution.

The expectation propagation method is more accurate and robust than the Laplace approximation, and it can handle any likelihood function that belongs to the exponential family. However, it also has some limitations, such as being more computationally expensive and complex to implement, and being prone to numerical instability and local optima.

To implement the expectation propagation method, you can use some of the existing libraries and frameworks that support Gaussian processes, such as GPflow, GPyTorch, and Pyro. For example, in GPflow, you can use the gpflow.models.SVGP class to create a sparse variational Gaussian process model with an expectation propagation method, and then use the gpflow.optimizers.NaturalGradient class to optimize the model parameters. You can also use the gpflow.models.SVGP.predict_y method to make predictions for new inputs.

6. Conclusion

In this blog, you have learned the fundamentals of Gaussian processes, and how to use them for probabilistic deep learning. You have learned:

What are Gaussian processes, and what are their advantages and limitations?
How to choose kernel functions, which determine the shape and behavior of Gaussian processes?
How to perform Gaussian process regression, which is the task of predicting a continuous output given some inputs?
How to perform Gaussian process classification, which is the task of predicting a discrete output given some inputs?
How to use the Laplace approximation and the expectation propagation methods, which are two techniques that can approximate the posterior and the predictive distributions in Gaussian process classification?

By the end of this blog, you have gained a solid understanding of Gaussian processes, and how to apply them to your own problems. You have also learned how to use some of the existing libraries and frameworks that implement Gaussian processes, such as GPflow, GPyTorch, and Pyro.

Gaussian processes are a powerful and flexible way to model functions and data distributions, and to perform probabilistic inference and prediction. They can handle complex and noisy data, and provide reliable estimates of their confidence and error. They can also be combined with other probabilistic models and deep neural networks, to create more expressive and robust models.

We hope that this blog has inspired you to explore the world of Gaussian processes, and to use them for your own projects. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!