Machine Learning Evaluation Mastery: How to Use Bootstrap for Model Evaluation and Comparison

This blog teaches you how to use bootstrap to evaluate and compare the performance of your machine learning models with confidence intervals and hypothesis tests.

Table of Contents

1. Introduction

Machine learning is a powerful tool for solving complex problems and making predictions based on data. However, how can you be sure that your machine learning model is reliable and accurate? How can you compare the performance of different models and choose the best one for your task?

One way to answer these questions is to use bootstrap, a statistical technique that allows you to estimate the uncertainty and variability of your model’s performance. Bootstrap can help you to evaluate and compare your models using confidence intervals and hypothesis tests, which are essential tools for statistical inference and decision making.

In this blog, you will learn how to use bootstrap for model evaluation and comparison in machine learning. You will learn:

What is bootstrap and why is it useful for machine learning?
How to perform bootstrap for model evaluation and comparison?
How to implement bootstrap in Python using scikit-learn and numpy?

By the end of this blog, you will have a solid understanding of bootstrap and how to apply it to your machine learning projects. You will also be able to use bootstrap to improve your model’s performance and select the best model for your task.

Are you ready to master bootstrap for machine learning? Let’s get started!

2. What is Bootstrap and Why is it Useful for Machine Learning?

Bootstrap is a statistical technique that allows you to estimate the uncertainty and variability of any statistic or parameter based on a sample of data. It is also known as bootstrapping or resampling with replacement.

The basic idea of bootstrap is to create many new samples from the original data by randomly drawing observations with replacement. Each new sample is called a bootstrap sample and has the same size as the original data. Then, you can calculate the statistic or parameter of interest for each bootstrap sample and use the results to estimate its distribution, mean, standard deviation, confidence interval, or any other measure of uncertainty.

Bootstrap is useful for machine learning because it can help you to evaluate and compare the performance of your models without making any assumptions about the underlying distribution of the data or the models. For example, you can use bootstrap to:

Estimate the accuracy, precision, recall, F1-score, ROC curve, or any other metric of your model and its confidence interval.
Compare the performance of two or more models and test whether the difference is statistically significant or not.
Select the best model among a set of candidates based on the bootstrap ranking or the bootstrap aggregation methods.

Bootstrap is also a flexible and powerful technique that can be applied to any type of data, model, or problem. You can use bootstrap for classification, regression, clustering, dimensionality reduction, feature selection, or any other machine learning task.

How does bootstrap work in practice? How can you perform bootstrap for model evaluation and comparison in machine learning? Let’s find out in the next section!

2.1. The Bootstrap Method

The bootstrap method is a simple and powerful technique that allows you to estimate the uncertainty and variability of any statistic or parameter based on a sample of data. The bootstrap method consists of three main steps:

Resampling: Create many new samples from the original data by randomly drawing observations with replacement. Each new sample is called a bootstrap sample and has the same size as the original data.
Estimation: Calculate the statistic or parameter of interest for each bootstrap sample. For example, you can calculate the mean, median, standard deviation, correlation, accuracy, precision, recall, or any other metric for each bootstrap sample.
Inference: Use the results of the estimation step to estimate the distribution, mean, standard deviation, confidence interval, or any other measure of uncertainty of the statistic or parameter of interest. For example, you can use the mean and standard deviation of the bootstrap estimates to construct a confidence interval for the true value of the statistic or parameter.

The bootstrap method is based on the idea that the bootstrap samples are representative of the original data and that the bootstrap estimates are representative of the true value of the statistic or parameter. Therefore, the bootstrap method can provide accurate and reliable estimates of the uncertainty and variability of any statistic or parameter, without making any assumptions about the underlying distribution of the data or the statistic or parameter.

How can you apply the bootstrap method to evaluate and compare your machine learning models? What are the advantages and disadvantages of the bootstrap method? How can you implement the bootstrap method in Python? We will answer these questions in the following sections.

2.2. Bootstrap Applications in Machine Learning

Bootstrap has many applications in machine learning, especially for model evaluation and comparison. In this section, we will discuss some of the most common and useful applications of bootstrap for machine learning, such as:

Estimating the confidence interval of a model’s performance metric: You can use bootstrap to estimate the range of values that contains the true value of a model’s performance metric (such as accuracy, precision, recall, F1-score, etc.) with a certain probability (such as 95%). This can help you to assess the reliability and variability of your model’s performance and to compare it with other models or benchmarks.
Comparing the performance of two or more models using hypothesis testing: You can use bootstrap to test whether the difference in performance between two or more models is statistically significant or not. This can help you to select the best model for your task and to avoid overfitting or underfitting.
Selecting the best model among a set of candidates using bootstrap ranking or bootstrap aggregation: You can use bootstrap to rank or aggregate the performance of a set of candidate models based on their bootstrap estimates. This can help you to choose the most robust and stable model that performs well on average and across different bootstrap samples.

These applications of bootstrap can be applied to any type of machine learning model, such as classification, regression, clustering, dimensionality reduction, feature selection, etc. They can also be applied to any type of data, such as numerical, categorical, text, image, audio, video, etc.

How can you perform these applications of bootstrap in practice? What are the steps and the code to implement them in Python? We will show you how in the next section!

3. How to Perform Bootstrap for Model Evaluation and Comparison

In this section, we will show you how to perform bootstrap for model evaluation and comparison in machine learning. We will use Python as the programming language and scikit-learn and numpy as the main libraries. We will also use some other libraries such as pandas, matplotlib, and seaborn for data manipulation and visualization.

We will follow these general steps for each application of bootstrap:

Prepare the data: Load and split the data into training and test sets. You can use any dataset of your choice, but we will use the breast cancer dataset from scikit-learn as an example.
Train the models: Train one or more models on the training set using any machine learning algorithm of your choice. You can use any model that suits your task, but we will use logistic regression and support vector machine as examples.
Evaluate the models: Evaluate the models on the test set using any performance metric of your choice. You can use any metric that suits your task, but we will use accuracy as an example.
Perform bootstrap: Perform bootstrap resampling, estimation, and inference on the test set using the models and the metric. You can use any bootstrap method of your choice, but we will use the bootstrap aggregating or bagging method as an example.
Analyze the results: Analyze the results of the bootstrap and draw conclusions. You can use any analysis method of your choice, but we will use descriptive statistics and visualization as examples.

Are you ready to learn how to perform bootstrap for model evaluation and comparison in machine learning? Let’s dive into the code and the examples!

3.1. Bootstrap Resampling

Bootstrap resampling is the first step of the bootstrap method. It involves creating many new samples from the original data by randomly drawing observations with replacement. Each new sample is called a bootstrap sample and has the same size as the original data.

Bootstrap resampling can be done in different ways, depending on the type of data and the type of model. For example, you can use:

Simple bootstrap: Resample the entire data set, including both the features and the labels. This is suitable for models that do not depend on the order or the structure of the data, such as linear regression or logistic regression.
Stratified bootstrap: Resample the data set while preserving the proportion of each class in the labels. This is suitable for models that are sensitive to class imbalance, such as classification models.
Block bootstrap: Resample the data set while preserving the order or the structure of the data, such as time series or spatial data. This is suitable for models that are sensitive to temporal or spatial correlation, such as time series analysis or image processing.

Bootstrap resampling can be implemented in Python using the numpy.random.choice function, which allows you to draw random samples from an array with or without replacement. You can also use the sklearn.utils.resample function, which provides more options for resampling, such as stratification and weighting.

Here is an example of how to perform simple bootstrap resampling on the breast cancer dataset using numpy:

# Import numpy
import numpy as np

# Load the breast cancer dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data # features
y = data.target # labels

# Define the number of bootstrap samples
n_bootstrap = 100

# Define an empty list to store the bootstrap samples
bootstrap_samples = []

# Loop over the number of bootstrap samples
for i in range(n_bootstrap):
    # Draw random indices with replacement
    indices = np.random.choice(len(X), size=len(X), replace=True)
    # Extract the bootstrap sample
    X_bootstrap = X[indices]
    y_bootstrap = y[indices]
    # Append the bootstrap sample to the list
    bootstrap_samples.append((X_bootstrap, y_bootstrap))

Now you have a list of bootstrap samples that you can use for model evaluation and comparison. How can you do that? Let’s see in the next section!

3.2. Bootstrap Estimation

Bootstrap estimation is the second step of the bootstrap method. It involves calculating the statistic or parameter of interest for each bootstrap sample. For example, you can calculate the accuracy, precision, recall, F1-score, or any other metric for each bootstrap sample.

Bootstrap estimation can be done in different ways, depending on the type of model and the type of metric. For example, you can use:

Point estimation: Calculate the metric for each bootstrap sample and use the mean or the median of the bootstrap estimates as the point estimate of the true value of the metric.
Interval estimation: Calculate the metric for each bootstrap sample and use the lower and upper percentiles of the bootstrap estimates as the lower and upper bounds of the confidence interval of the true value of the metric.
Distribution estimation: Calculate the metric for each bootstrap sample and use the histogram or the density plot of the bootstrap estimates as the estimate of the distribution of the true value of the metric.

Bootstrap estimation can be implemented in Python using the sklearn.metrics module, which provides various functions to calculate different metrics for machine learning models. You can also use the numpy statistics module, which provides various functions to calculate descriptive statistics and percentiles.

Here is an example of how to perform bootstrap estimation on the breast cancer dataset using sklearn and numpy:

# Import sklearn and numpy
import sklearn
import numpy as np

# Load the breast cancer dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data # features
y = data.target # labels

# Define the number of bootstrap samples
n_bootstrap = 100

# Define an empty list to store the bootstrap samples
bootstrap_samples = []

# Loop over the number of bootstrap samples
for i in range(n_bootstrap):
    # Draw random indices with replacement
    indices = np.random.choice(len(X), size=len(X), replace=True)
    # Extract the bootstrap sample
    X_bootstrap = X[indices]
    y_bootstrap = y[indices]
    # Append the bootstrap sample to the list
    bootstrap_samples.append((X_bootstrap, y_bootstrap))

# Train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

# Define an empty list to store the bootstrap estimates
bootstrap_estimates = []

# Loop over the bootstrap samples
for X_bootstrap, y_bootstrap in bootstrap_samples:
    # Evaluate the model on the bootstrap sample using accuracy as the metric
    y_pred = model.predict(X_bootstrap)
    accuracy = sklearn.metrics.accuracy_score(y_bootstrap, y_pred)
    # Append the accuracy to the list
    bootstrap_estimates.append(accuracy)

# Calculate the point estimate of the accuracy using the mean of the bootstrap estimates
point_estimate = np.mean(bootstrap_estimates)
print(f"The point estimate of the accuracy is {point_estimate:.3f}")

# Calculate the 95% confidence interval of the accuracy using the 2.5th and 97.5th percentiles of the bootstrap estimates
lower_bound = np.percentile(bootstrap_estimates, 2.5)
upper_bound = np.percentile(bootstrap_estimates, 97.5)
print(f"The 95% confidence interval of the accuracy is ({lower_bound:.3f}, {upper_bound:.3f})")

# Plot the distribution of the accuracy using the histogram of the bootstrap estimates
import matplotlib.pyplot as plt
plt.hist(bootstrap_estimates, bins=10, edgecolor='black')
plt.xlabel('Accuracy')
plt.ylabel('Frequency')
plt.title('Bootstrap Distribution of the Accuracy')
plt.show()

Now you have the point estimate, the confidence interval, and the distribution of the accuracy of your model based on the bootstrap samples. How can you use them to evaluate and compare your model? Let’s see in the next section!

3.3. Bootstrap Hypothesis Testing

Bootstrap hypothesis testing is the third step of the bootstrap method. It involves testing whether the difference in performance between two or more models is statistically significant or not. For example, you can test whether the accuracy of model A is higher than the accuracy of model B or not.

Bootstrap hypothesis testing can be done in different ways, depending on the type of hypothesis and the type of test. For example, you can use:

One-sample test: Test whether the mean or the median of the bootstrap estimates of a single model is equal to a given value or not. For example, you can test whether the mean accuracy of model A is equal to 0.8 or not.
Two-sample test: Test whether the mean or the median of the bootstrap estimates of two models are equal or not. For example, you can test whether the mean accuracy of model A is equal to the mean accuracy of model B or not.
Paired test: Test whether the mean or the median of the differences between the bootstrap estimates of two models are equal to zero or not. For example, you can test whether the mean difference in accuracy between model A and model B is equal to zero or not.

Bootstrap hypothesis testing can be implemented in Python using the scipy.stats module, which provides various functions to perform different statistical tests. You can also use the pingouin module, which provides a user-friendly interface to perform bootstrap hypothesis testing.

Here is an example of how to perform bootstrap hypothesis testing on the breast cancer dataset using scipy and pingouin:

# Import scipy and pingouin
import scipy
import pingouin as pg

# Load the breast cancer dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data # features
y = data.target # labels

# Define the number of bootstrap samples
n_bootstrap = 100

# Define an empty list to store the bootstrap samples
bootstrap_samples = []

# Loop over the number of bootstrap samples
for i in range(n_bootstrap):
    # Draw random indices with replacement
    indices = np.random.choice(len(X), size=len(X), replace=True)
    # Extract the bootstrap sample
    X_bootstrap = X[indices]
    y_bootstrap = y[indices]
    # Append the bootstrap sample to the list
    bootstrap_samples.append((X_bootstrap, y_bootstrap))

# Train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
model_A = LogisticRegression()
model_A.fit(X_train, y_train)

# Train a support vector machine model on the training set
from sklearn.svm import SVC
model_B = SVC()
model_B.fit(X_train, y_train)

# Define an empty list to store the bootstrap estimates
bootstrap_estimates_A = []
bootstrap_estimates_B = []

# Loop over the bootstrap samples
for X_bootstrap, y_bootstrap in bootstrap_samples:
    # Evaluate the models on the bootstrap sample using accuracy as the metric
    y_pred_A = model_A.predict(X_bootstrap)
    accuracy_A = sklearn.metrics.accuracy_score(y_bootstrap, y_pred_A)
    y_pred_B = model_B.predict(X_bootstrap)
    accuracy_B = sklearn.metrics.accuracy_score(y_bootstrap, y_pred_B)
    # Append the accuracies to the lists
    bootstrap_estimates_A.append(accuracy_A)
    bootstrap_estimates_B.append(accuracy_B)

# Perform a one-sample test to test whether the mean accuracy of model A is equal to 0.8 or not
# Use the scipy.stats.ttest_1samp function
t_stat, p_value = scipy.stats.ttest_1samp(bootstrap_estimates_A, 0.8)
print(f"The t-statistic is {t_stat:.3f} and the p-value is {p_value:.3f}")

# Perform a two-sample test to test whether the mean accuracy of model A is equal to the mean accuracy of model B or not
# Use the scipy.stats.ttest_ind function
t_stat, p_value = scipy.stats.ttest_ind(bootstrap_estimates_A, bootstrap_estimates_B)
print(f"The t-statistic is {t_stat:.3f} and the p-value is {p_value:.3f}")

# Perform a paired test to test whether the mean difference in accuracy between model A and model B is equal to zero or not
# Use the pingouin.ttest function with paired=True
t_stat, p_value = pg.ttest(bootstrap_estimates_A, bootstrap_estimates_B, paired=True)
print(f"The t-statistic is {t_stat:.3f} and the p-value is {p_value:.3f}")

Now you have the results of the bootstrap hypothesis testing and you can use them to compare your models and draw conclusions. How can you do that? Let’s see in the next section!

4. Examples of Bootstrap for Model Evaluation and Comparison in Python

In this section, we will show you some examples of how to use bootstrap for model evaluation and comparison in Python. We will use the breast cancer dataset from scikit-learn as an example, but you can use any dataset of your choice. We will also use logistic regression and support vector machine as the models, but you can use any machine learning algorithm of your choice.

We will demonstrate how to use bootstrap for the following applications:

Example 1: Comparing Two Classification Models: We will compare the accuracy of logistic regression and support vector machine using bootstrap hypothesis testing and visualization.
Example 2: Comparing Two Regression Models: We will compare the mean squared error of linear regression and ridge regression using bootstrap hypothesis testing and visualization.

For each example, we will follow the general steps of bootstrap that we learned in the previous sections:

Prepare the data: Load and split the data into training and test sets.
Train the models: Train one or more models on the training set using any machine learning algorithm of your choice.
Evaluate the models: Evaluate the models on the test set using any performance metric of your choice.
Perform bootstrap: Perform bootstrap resampling, estimation, and inference on the test set using the models and the metric.
Analyze the results: Analyze the results of the bootstrap and draw conclusions.

Are you ready to see some examples of bootstrap for model evaluation and comparison in Python? Let’s get started!

4.1. Example 1: Comparing Two Classification Models

In this example, we will compare the accuracy of logistic regression and support vector machine using bootstrap hypothesis testing and visualization. We will use the breast cancer dataset from scikit-learn as an example, but you can use any dataset of your choice. We will follow the general steps of bootstrap that we learned in the previous sections:

Prepare the data: Load and split the data into training and test sets.
Train the models: Train a logistic regression model and a support vector machine model on the training set.
Evaluate the models: Evaluate the models on the test set using accuracy as the metric.
Perform bootstrap: Perform bootstrap resampling, estimation, and hypothesis testing on the test set using the models and the metric.
Analyze the results: Analyze the results of the bootstrap and draw conclusions.

Let’s start with the first step: preparing the data.

4.2. Example 2: Comparing Two Regression Models

In this example, we will compare the mean squared error of linear regression and ridge regression using bootstrap hypothesis testing and visualization. We will use the Boston housing dataset from scikit-learn as an example, but you can use any dataset of your choice. We will follow the general steps of bootstrap that we learned in the previous sections:

Prepare the data: Load and split the data into training and test sets.
Train the models: Train a linear regression model and a ridge regression model on the training set.
Evaluate the models: Evaluate the models on the test set using mean squared error as the metric.
Perform bootstrap: Perform bootstrap resampling, estimation, and hypothesis testing on the test set using the models and the metric.
Analyze the results: Analyze the results of the bootstrap and draw conclusions.

Let’s start with the first step: preparing the data.

To load and split the data, we will use the sklearn.datasets.load_boston function and the sklearn.model_selection.train_test_split function. The Boston housing dataset contains 506 observations of 13 features and 1 target variable, which is the median value of owner-occupied homes in $1000s. We will split the data into 80% training and 20% test sets, using a random state of 42 for reproducibility. Here is the code to do that:

# Import sklearn
import sklearn

# Load the Boston housing dataset
from sklearn.datasets import load_boston
data = load_boston()
X = data.data # features
y = data.target # target

# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now we have the data ready for the next step: training the models.

5. Conclusion

In this blog, you have learned how to use bootstrap for model evaluation and comparison in machine learning. You have learned:

What is bootstrap and why is it useful for machine learning?
How to perform bootstrap for model evaluation and comparison?
How to implement bootstrap in Python using scikit-learn and numpy?
How to use bootstrap hypothesis testing and visualization to compare the performance of two or more models?

By using bootstrap, you can estimate the uncertainty and variability of your model’s performance and compare the performance of different models with confidence intervals and hypothesis tests. Bootstrap is a flexible and powerful technique that can be applied to any type of data, model, or problem. Bootstrap can help you to improve your model’s performance and select the best model for your task.

We hope you have enjoyed this blog and learned something new and useful. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading and happy bootstrapping!