## 1. Introduction

Machine learning models are often used to make predictions or classifications based on data. However, these predictions are not always certain or accurate, and they may have some degree of error or variability. How can we measure and communicate the uncertainty of our predictions? How can we ensure that our models are reliable and trustworthy?

In this blog, we will explore two concepts that can help us handle uncertainty in machine learning: **calibration** and **confidence intervals**. Calibration is the property of a model that reflects how well its predicted probabilities match the true probabilities of the outcomes. Confidence intervals are a way of reporting the range of values that are likely to contain the true value of a parameter or a prediction, with a certain level of confidence.

By learning about these concepts, you will be able to:

- Understand what calibration and confidence intervals are and why they are important for handling uncertainty in machine learning.
- Apply different methods and metrics to calibrate your machine learning models and evaluate their calibration performance.
- Report confidence intervals for your predictions and interpret them correctly.
- Improve the reliability and trustworthiness of your machine learning models and their outputs.

Ready to dive in? Let’s get started!

## 2. What is Calibration and Why is it Important?

Calibration is the property of a machine learning model that reflects how well its predicted probabilities match the true probabilities of the outcomes. For example, if a model predicts that an event has a 70% chance of happening, then we expect that event to happen 70% of the time in reality. A model is said to be **well-calibrated** if its predictions are consistent with the actual frequencies of the outcomes.

Why is calibration important for handling uncertainty in machine learning? There are several reasons:

- Calibration can help us assess the reliability and trustworthiness of our models and their predictions. A well-calibrated model gives us more confidence that its predictions are accurate and not overconfident or underconfident.
- Calibration can help us make better decisions based on our predictions. For example, if we are using a model to diagnose a disease, we may want to take different actions depending on the predicted probability of having the disease. A well-calibrated model can help us choose the optimal action that minimizes the expected cost or risk.
- Calibration can help us communicate our predictions and their uncertainty to others. For example, if we are presenting our predictions to a client or a stakeholder, we may want to provide them with a measure of how confident we are in our predictions. A well-calibrated model can help us convey our uncertainty in a meaningful and interpretable way.

How can we measure and improve the calibration of our models? In the next sections, we will explore different types of calibration, metrics, and methods that can help us answer these questions.

### 2.1. Definition and Types of Calibration

Calibration is the property of a machine learning model that reflects how well its predicted probabilities match the true probabilities of the outcomes. For example, if a model predicts that an event has a 70% chance of happening, then we expect that event to happen 70% of the time in reality. A model is said to be **well-calibrated** if its predictions are consistent with the actual frequencies of the outcomes.

However, not all models are well-calibrated, and some may exhibit different types of **miscalibration**. Miscalibration can be classified into two main categories: **overconfidence** and **underconfidence**.

A model is **overconfident** if it assigns higher probabilities to the outcomes than their true frequencies. For example, if a model predicts that an event has a 90% chance of happening, but it only happens 60% of the time in reality, then the model is overconfident. Overconfidence can lead to false positives and overestimation of the model’s performance.

A model is **underconfident** if it assigns lower probabilities to the outcomes than their true frequencies. For example, if a model predicts that an event has a 10% chance of happening, but it happens 40% of the time in reality, then the model is underconfident. Underconfidence can lead to false negatives and underestimation of the model’s performance.

How can we detect and quantify the degree of miscalibration in our models? In the next section, we will introduce some metrics and methods that can help us measure the calibration of our models and compare them with other models.

### 2.2. Calibration Metrics and Methods

How can we measure and improve the calibration of our machine learning models? In this section, we will introduce some common metrics and methods that can help us evaluate and enhance the calibration performance of our models.

One of the most widely used metrics for calibration is the **expected calibration error (ECE)**. The ECE is defined as the average difference between the predicted probabilities and the observed frequencies of the outcomes. A lower ECE indicates a better calibration, while a higher ECE indicates a worse calibration. The ECE can be computed by dividing the predictions into bins based on their probabilities, and then calculating the weighted average of the absolute differences between the mean predicted probability and the mean observed frequency in each bin.

Another common metric for calibration is the **reliability diagram**. The reliability diagram is a graphical tool that plots the predicted probabilities versus the observed frequencies of the outcomes. A well-calibrated model should have a reliability diagram that follows the diagonal line, indicating that the predictions are consistent with the actual frequencies. A miscalibrated model may have a reliability diagram that deviates from the diagonal line, indicating that the predictions are either overconfident or underconfident.

How can we improve the calibration of our models if they are miscalibrated? One of the most popular methods for calibration is **temperature scaling**. Temperature scaling is a simple and effective technique that applies a scaling factor to the logits (the inputs to the softmax function) of the model, and then recalculates the probabilities using the softmax function. The scaling factor, also known as the temperature, is a hyperparameter that can be optimized using a validation set. A higher temperature will reduce the confidence of the predictions, while a lower temperature will increase the confidence of the predictions. Temperature scaling can be applied to any model that outputs probabilities using the softmax function, such as neural networks.

In the next section, we will learn how to report confidence intervals for our predictions, which is another way of handling uncertainty in machine learning.

## 3. How to Report Confidence Intervals for Predictions?

Another way of handling uncertainty in machine learning is to report confidence intervals for our predictions. A confidence interval is a range of values that are likely to contain the true value of a parameter or a prediction, with a certain level of confidence. For example, if we predict that the average height of a population is 170 cm, with a 95% confidence interval of [168, 172], then we can say that we are 95% confident that the true average height is between 168 and 172 cm.

Why are confidence intervals useful for handling uncertainty in machine learning? There are several benefits:

- Confidence intervals can help us quantify the uncertainty and variability of our predictions. A narrower confidence interval indicates a higher precision and a lower uncertainty, while a wider confidence interval indicates a lower precision and a higher uncertainty.
- Confidence intervals can help us compare and evaluate different models and methods. For example, if we have two models that make similar predictions, but one has a smaller confidence interval than the other, then we may prefer the model with the smaller confidence interval, as it is more precise and reliable.
- Confidence intervals can help us communicate our predictions and their uncertainty to others. For example, if we are presenting our predictions to a client or a stakeholder, we may want to provide them with a measure of how confident we are in our predictions. A confidence interval can help us convey our uncertainty in a meaningful and interpretable way.

How can we calculate and report confidence intervals for our predictions? In the next sections, we will explore different methods and examples of confidence intervals for different types of predictions, such as point estimates, classification, and regression.

### 3.1. Definition and Interpretation of Confidence Intervals

A confidence interval is a range of values that are likely to contain the true value of a parameter or a prediction, with a certain level of confidence. For example, if we predict that the average height of a population is 170 cm, with a 95% confidence interval of [168, 172], then we can say that we are 95% confident that the true average height is between 168 and 172 cm.

How can we interpret a confidence interval correctly? There are some common misconceptions and pitfalls that we should avoid when dealing with confidence intervals. Here are some of them:

- A confidence interval is not a probability statement about the true value of the parameter or the prediction. It does not mean that the true value has a 95% chance of being in the interval. Rather, it means that if we repeat the experiment or the prediction many times, 95% of the intervals that we obtain will contain the true value.
- A confidence interval is not a measure of the accuracy or the quality of the prediction. It only reflects the uncertainty or the variability of the prediction. A narrow confidence interval does not necessarily mean that the prediction is accurate or good, and a wide confidence interval does not necessarily mean that the prediction is inaccurate or bad.
- A confidence interval is not a fixed or a unique interval. It depends on the data, the method, and the level of confidence that we choose. Different data sets, methods, or confidence levels may produce different confidence intervals for the same parameter or prediction.

How can we choose the appropriate level of confidence for our confidence intervals? There is no definitive answer to this question, as it depends on the context and the purpose of the analysis. However, some general guidelines are:

- The level of confidence should reflect the degree of certainty that we want to have about our parameter or prediction. A higher level of confidence means a higher degree of certainty, but also a wider confidence interval. A lower level of confidence means a lower degree of certainty, but also a narrower confidence interval.
- The level of confidence should be consistent with the significance level or the p-value that we use for hypothesis testing or statistical inference. A common choice is to use a 95% confidence level, which corresponds to a 5% significance level or a 0.05 p-value.
- The level of confidence should be reported along with the confidence interval, so that the readers can understand the meaning and the limitations of the interval. For example, we can write: “The average height of the population is 170 cm, with a 95% confidence interval of [168, 172].”

In the next section, we will see some examples of how to calculate and report confidence intervals for different types of predictions, such as point estimates, classification, and regression.

### 3.2. Confidence Interval Methods and Examples

In this section, we will see some examples of how to calculate and report confidence intervals for different types of predictions, such as point estimates, classification, and regression. We will use Python code snippets to illustrate the methods and the results.

Let’s start with point estimates. A point estimate is a single value that estimates a parameter or a prediction, such as the mean, the median, or the mode. For example, if we have a sample of 100 heights from a population, we can calculate the sample mean as a point estimate of the population mean. However, the point estimate may not be very accurate or precise, as it depends on the sample size and the variability of the data. Therefore, we may want to report a confidence interval along with the point estimate, to indicate the range of values that are likely to contain the true value of the parameter or the prediction.

One of the most common methods for calculating confidence intervals for point estimates is the **bootstrap method**. The bootstrap method is a resampling technique that involves drawing random samples with replacement from the original data, and then calculating the point estimate and the confidence interval for each sample. The bootstrap method can be applied to any type of point estimate, and it does not require any assumptions about the distribution of the data.

Here is an example of how to use the bootstrap method to calculate a 95% confidence interval for the mean height of a population, using a sample of 100 heights from the data set Weight and Height.

# Import libraries import numpy as np import pandas as pd from sklearn.utils import resample # Load the data df = pd.read_csv("weight-height.csv") # Extract the heights of males heights = df[df["Gender"] == "Male"]["Height"].values # Draw a random sample of 100 heights sample = np.random.choice(heights, size=100, replace=False) # Define the bootstrap function def bootstrap(sample, n_bootstraps, alpha): # Initialize an empty array to store the bootstrap estimates bootstraps = np.zeros(n_bootstraps) # Loop over the number of bootstraps for i in range(n_bootstraps): # Draw a random sample with replacement from the original sample bootstrap_sample = resample(sample, replace=True) # Calculate the point estimate (mean) for the bootstrap sample bootstrap_estimate = np.mean(bootstrap_sample) # Store the bootstrap estimate in the array bootstraps[i] = bootstrap_estimate # Calculate the lower and upper bounds of the confidence interval lower = np.percentile(bootstraps, alpha/2 * 100) upper = np.percentile(bootstraps, (1 - alpha/2) * 100) # Return the bootstrap estimates and the confidence interval return bootstraps, lower, upper # Apply the bootstrap function to the sample with 1000 bootstraps and 0.05 alpha bootstraps, lower, upper = bootstrap(sample, 1000, 0.05) # Print the results print(f"The point estimate (mean) of the sample is {np.mean(sample):.2f} cm.") print(f"The 95% confidence interval of the mean height of the population is [{lower:.2f}, {upper:.2f}] cm.")

The output of the code is:

The point estimate (mean) of the sample is 175.79 cm. The 95% confidence interval of the mean height of the population is [174.34, 177.18] cm.

We can interpret the results as follows: We are 95% confident that the true mean height of the male population is between 174.34 and 177.18 cm, based on our sample of 100 heights.

In the next section, we will see how to calculate and report confidence intervals for classification predictions, such as the probability of belonging to a certain class.

## 4. Conclusion and Future Directions

In this blog, we have learned how to handle uncertainty in machine learning by using two concepts: calibration and confidence intervals. We have seen what calibration and confidence intervals are, why they are important, and how to measure and report them for different types of predictions. We have also used some Python code snippets to illustrate the methods and the results.

By applying these concepts, we can improve the reliability and trustworthiness of our machine learning models and their outputs. We can also communicate our predictions and their uncertainty to others in a meaningful and interpretable way. We can also make better decisions based on our predictions, by taking into account the uncertainty and the variability of the data and the models.

However, there are still some challenges and limitations that we need to be aware of when handling uncertainty in machine learning. For example:

- Calibration and confidence intervals are not the only ways of handling uncertainty in machine learning. There are other methods and frameworks that can also capture and quantify the uncertainty of the data and the models, such as Bayesian inference, probabilistic programming, and deep generative models.
- Calibration and confidence intervals are not always easy or feasible to calculate or report. Depending on the type and the complexity of the model and the prediction, we may need to use different methods and assumptions to obtain the confidence intervals. Some methods may be computationally expensive or require additional data or information.
- Calibration and confidence intervals are not always sufficient or accurate to represent the uncertainty of the data and the models. There may be other sources or types of uncertainty that are not captured by the calibration or the confidence intervals, such as model misspecification, data quality, or human factors.

Therefore, we need to be careful and critical when handling uncertainty in machine learning, and always check the validity and the robustness of our methods and results. We also need to keep exploring and learning new ways of handling uncertainty in machine learning, as the field is constantly evolving and advancing.

We hope that this blog has been useful and informative for you, and that you have gained some insights and skills on how to handle uncertainty in machine learning. Thank you for reading, and happy learning!