F1 Machine Learning Essentials: Optimizing F1 Score with Threshold Tuning

Learn how to optimize F1 score for binary classification models by finding the best threshold using ROC curve and precision-recall curve.

Table of Contents

1. Introduction

In this tutorial, you will learn how to optimize the F1 score for binary classification models by finding the best threshold using the ROC curve and the precision-recall curve.

F1 score is a popular metric for evaluating the performance of binary classification models, especially when the data is imbalanced. It is the harmonic mean of precision and recall, two measures that reflect how well the model can identify the positive class and avoid the negative class.

However, F1 score is not a fixed value for a given model. It depends on the threshold that the model uses to make predictions. The threshold is the probability value that determines whether an instance is classified as positive or negative. By default, most models use a threshold of 0.5, but this may not be the optimal choice for maximizing the F1 score.

So, how can you find the optimal threshold for your model? One way is to use the ROC curve and the precision-recall curve, two graphical tools that show the trade-off between different metrics at different thresholds. By plotting these curves and finding the point that maximizes the F1 score, you can improve your model’s performance and accuracy.

In this tutorial, you will learn how to:

Calculate the F1 score for binary classification models
Plot the ROC curve and the precision-recall curve using Python
Find the optimal threshold using the F1 score

Before you begin, make sure you have the following prerequisites:

A basic understanding of binary classification and its metrics
A Python environment with scikit-learn, matplotlib, and numpy installed
A sample dataset for binary classification (you can use the one provided in this tutorial or your own data)

Ready to optimize your F1 score? Let’s get started!

2. What is F1 Score and Why is it Important?

F1 score is a metric that combines two important aspects of binary classification: precision and recall. Precision measures how accurate the model is in identifying the positive class, while recall measures how complete the model is in capturing the positive class.

Precision is calculated as the ratio of true positives (TP) to the sum of true positives and false positives (FP). Recall is calculated as the ratio of true positives to the sum of true positives and false negatives (FN).

F1 score is the harmonic mean of precision and recall, which means it gives more weight to the lower value. It is calculated as follows:

$$F1 = \frac{2 \times precision \times recall}{precision + recall}$$

Why is F1 score important? F1 score is especially useful when the data is imbalanced, meaning there are more instances of one class than the other. For example, in a fraud detection problem, the positive class (fraud) is much less frequent than the negative class (non-fraud). In this case, accuracy (the ratio of correct predictions to the total number of predictions) is not a good metric, because it can be high even if the model predicts most instances as negative.

F1 score, on the other hand, takes into account both precision and recall, which are more relevant for imbalanced data. A high F1 score means that the model can correctly identify the positive class and avoid the negative class, while a low F1 score means that the model either misses the positive class or misclassifies the negative class.

Therefore, F1 score is a good metric to optimize when you want to balance precision and recall for binary classification problems. However, as we will see in the next section, F1 score is not a fixed value for a given model. It depends on the threshold that the model uses to make predictions.

3. How to Calculate F1 Score for Binary Classification

In this section, you will learn how to calculate the F1 score for binary classification models using Python. You will use the scikit-learn library, which provides various functions and classes for machine learning tasks.

To calculate the F1 score, you need two inputs: the true labels and the predicted labels of the instances. The true labels are the actual class values of the instances, while the predicted labels are the class values that the model assigns to the instances based on the threshold.

You can obtain the true labels from the data set that you use for testing or validation. You can obtain the predicted labels by applying the model to the test or validation data and using the predict method. Alternatively, you can use the predict_proba method to get the probability values for each class and then apply a custom threshold to get the predicted labels.

Once you have the true labels and the predicted labels, you can use the f1_score function from the sklearn.metrics module to calculate the F1 score. The function takes the true labels and the predicted labels as arguments and returns the F1 score as a floating-point number.

Here is an example of how to calculate the F1 score for a binary classification model using scikit-learn:

# Import the necessary modules
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Load the data set
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get the predicted labels using the default threshold of 0.5
y_pred = model.predict(X_test)

# Calculate the F1 score
f1 = f1_score(y_test, y_pred)
print(f"The F1 score using the default threshold is {f1:.3f}")

The output of the code is:

The F1 score using the default threshold is 0.957

This means that the model has a high F1 score using the default threshold of 0.5. However, this may not be the optimal threshold for maximizing the F1 score. In the next section, you will learn how to plot the ROC curve and the precision-recall curve to explore the effect of different thresholds on the F1 score.

4. How to Plot ROC Curve and Precision-Recall Curve

In this section, you will learn how to plot the ROC curve and the precision-recall curve for binary classification models using Python. These curves are graphical tools that show the trade-off between different metrics at different thresholds. They can help you visualize how the model performs across the range of possible thresholds and find the optimal threshold for maximizing the F1 score.

The ROC curve stands for the receiver operating characteristic curve. It plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold values. The TPR is the same as recall, while the FPR is the ratio of false positives to the sum of false positives and true negatives (TN). The ROC curve shows how well the model can distinguish between the positive and negative classes. A good model has a high TPR and a low FPR, which means it can identify most of the positive instances and avoid most of the negative instances. The ROC curve also has an area under the curve (AUC) value, which measures the overall performance of the model. A higher AUC value means a better model.

The precision-recall curve plots the precision against the recall at various threshold values. It shows how well the model can balance precision and recall. A good model has a high precision and a high recall, which means it can accurately and completely identify the positive class. The precision-recall curve also has an average precision (AP) value, which measures the average precision across all thresholds. A higher AP value means a better model.

You can use the roc_curve and precision_recall_curve functions from the sklearn.metrics module to calculate the TPR, FPR, precision, and recall values for different thresholds. You can also use the auc and average_precision_score functions to calculate the AUC and AP values. To plot the curves, you can use the matplotlib.pyplot module, which provides various functions for creating and customizing graphs.

Here is an example of how to plot the ROC curve and the precision-recall curve for a binary classification model using scikit-learn and matplotlib:

# Import the necessary modules
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, precision_recall_curve, auc, average_precision_score
import matplotlib.pyplot as plt

# Load the data set
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get the probability values for the positive class
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate the TPR, FPR, precision, and recall values for different thresholds
fpr, tpr, thresholds1 = roc_curve(y_test, y_prob)
precision, recall, thresholds2 = precision_recall_curve(y_test, y_prob)

# Calculate the AUC and AP values
auc_value = auc(fpr, tpr)
ap_value = average_precision_score(y_test, y_prob)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {auc_value:.3f})")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray", label="Random classifier")
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.title("ROC curve for breast cancer classification")
plt.legend()
plt.show()

# Plot the precision-recall curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f"Precision-recall curve (AP = {ap_value:.3f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-recall curve for breast cancer classification")
plt.legend()
plt.show()

As you will see, the ROC curve and the precision-recall curve show the trade-off between different metrics at different thresholds. The ROC curve shows that the model has a high TPR and a low FPR across most thresholds, which means it can distinguish between the positive and negative classes well. The AUC value is close to 1, which indicates a good model. The precision-recall curve shows that the model has a high precision and a high recall at lower thresholds, but the precision drops as the recall increases at higher thresholds. This means that the model can balance precision and recall well at lower thresholds, but it becomes less accurate as it becomes more complete at higher thresholds. The AP value is also high, which indicates a good model.

However, these curves do not tell us which threshold is the best for maximizing the F1 score. In the next section, you will learn how to find the optimal threshold using the F1 score.

5. How to Find the Optimal Threshold Using F1 Score

In this section, you will learn how to find the optimal threshold for maximizing the F1 score using Python. You will use the same data and model from the previous section, and you will use the f1_score function from the sklearn.metrics module to calculate the F1 score for different thresholds.

The idea is to loop through the possible threshold values and calculate the F1 score for each one. Then, you can compare the F1 scores and find the threshold that gives the highest F1 score. This will be the optimal threshold for your model.

Here is an example of how to find the optimal threshold using the F1 score:

# Import the necessary modules
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Load the data set
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get the probability values for the positive class
y_prob = model.predict_proba(X_test)[:, 1]

# Initialize the variables to store the best threshold and F1 score
best_threshold = 0
best_f1 = 0

# Loop through the possible threshold values from 0 to 1 with a step of 0.01
for threshold in np.arange(0, 1, 0.01):
    # Apply the threshold to the probability values to get the predicted labels
    y_pred = (y_prob >= threshold).astype(int)
    # Calculate the F1 score for the current threshold
    f1 = f1_score(y_test, y_pred)
    # Update the best threshold and F1 score if the current F1 score is higher
    if f1 > best_f1:
        best_threshold = threshold
        best_f1 = f1

# Print the best threshold and F1 score
print(f"The optimal threshold is {best_threshold:.2f}")
print(f"The F1 score using the optimal threshold is {best_f1:.3f}")

The output of the code is:

The optimal threshold is 0.29
The F1 score using the optimal threshold is 0.962

This means that the optimal threshold for maximizing the F1 score is 0.29, which is lower than the default threshold of 0.5. Using this threshold, the F1 score is 0.962, which is higher than the F1 score using the default threshold of 0.957. This shows that by tuning the threshold, you can improve the performance of your model and optimize the F1 score.

In the next and final section, you will learn how to summarize the main points of this tutorial and provide some suggestions for further learning.

6. Conclusion

In this tutorial, you have learned how to optimize the F1 score for binary classification models by finding the best threshold using the ROC curve and the precision-recall curve. You have learned how to:

Explain what F1 score is and why it is important for imbalanced data
Calculate the F1 score for binary classification models using scikit-learn
Plot the ROC curve and the precision-recall curve using scikit-learn and matplotlib
Find the optimal threshold using the F1 score using a simple loop

By tuning the threshold, you can improve the performance of your model and balance precision and recall for binary classification problems. You can also use the same approach for other metrics, such as accuracy, specificity, or sensitivity.

We hope you enjoyed this tutorial and found it useful. If you want to learn more about machine learning, classification, and F1 score, here are some resources that you can check out:

Thank you for reading and happy learning!