F1 Machine Learning Essentials: Calculating F1 Score in Python

Learn how to calculate F1 score in Python using sklearn.metrics for binary and multiclass classification problems and how to interpret and improve it.

Table of Contents

1. Introduction

In this tutorial, you will learn how to calculate F1 score in Python using sklearn.metrics, a module that provides various performance metrics for machine learning tasks. F1 score is one of the most widely used metrics for evaluating the quality of classification models, especially when dealing with imbalanced data sets.

But what is F1 score exactly, and why is it important? How does it relate to other metrics such as precision and recall? How can you use it to compare different models and improve your model performance? These are some of the questions that we will answer in this tutorial.

By the end of this tutorial, you will be able to:

Explain what F1 score is and how it is calculated
Use sklearn.metrics to compute F1 score for binary and multiclass classification problems
Interpret F1 score and use it to evaluate and improve your model performance

To follow along, you will need a basic understanding of Python and machine learning concepts, such as classification, confusion matrix, and accuracy. You will also need to install sklearn, a popular machine learning library for Python. You can use the following command to install it:

pip install sklearn

Ready to dive in? Let’s get started!

2. What is F1 Score and Why is it Important?

F1 score is a metric that combines two other metrics: precision and recall. Precision measures how accurate your predictions are, while recall measures how complete your predictions are. F1 score is the harmonic mean of precision and recall, which means it gives more weight to low values. As a result, F1 score is a good measure of how balanced your model is between precision and recall.

But why do we need F1 score in the first place? Why can’t we just use accuracy, which is the ratio of correct predictions to total predictions? The answer is that accuracy can be misleading when dealing with imbalanced data sets, where one class is much more frequent than the other. For example, suppose you have a data set of 1000 emails, where 900 are spam and 100 are not. If you build a model that always predicts spam, you will get an accuracy of 90%, which sounds impressive. However, your model is actually very poor, because it fails to identify any of the non-spam emails. In this case, accuracy is not a good indicator of model quality.

This is where F1 score comes in handy. F1 score takes into account both precision and recall, which are more sensitive to imbalanced data sets. Precision tells you how many of your spam predictions are actually spam, while recall tells you how many of the actual spam emails you have identified. F1 score is the harmonic mean of these two metrics, which means it penalizes low values more than high values. Therefore, F1 score will be low if either precision or recall is low, which reflects the poor performance of the model.

So, how do you calculate F1 score? And how do you use it to evaluate and improve your model performance? These are the questions that we will answer in the next sections. Stay tuned!

2.1. Precision and Recall

Before we can understand F1 score, we need to understand two other metrics: precision and recall. These metrics are based on the concept of a confusion matrix, which is a table that shows the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for a given classification problem. Here is an example of a confusion matrix for a binary classification problem, where the positive class is spam and the negative class is not spam:

	Predicted Spam	Predicted Not Spam
Actual Spam	TP	FN
Actual Not Spam	FP	TN

Precision is the ratio of true positives to all predicted positives, which means it measures how accurate your predictions are. The formula for precision is:

$$\text{Precision} = \frac{TP}{TP + FP}$$

Recall is the ratio of true positives to all actual positives, which means it measures how complete your predictions are. The formula for recall is:

$$\text{Recall} = \frac{TP}{TP + FN}$$

Let’s see an example of how to calculate precision and recall for a binary classification problem. Suppose you have a data set of 1000 emails, where 900 are spam and 100 are not. You build a model that predicts 800 emails as spam and 200 as not spam. Out of the 800 predicted spam emails, 700 are actually spam and 100 are not. Out of the 200 predicted not spam emails, 100 are actually not spam and 100 are spam. The confusion matrix for this problem is:

	Predicted Spam	Predicted Not Spam
Actual Spam	700	200
Actual Not Spam	100	100

The precision and recall for this problem are:

$$\text{Precision} = \frac{700}{700 + 100} = 0.875$$

$$\text{Recall} = \frac{700}{700 + 200} = 0.778$$

This means that your model is 87.5% accurate and 77.8% complete in predicting spam emails. But how can you combine these two metrics into one? This is where F1 score comes in. In the next section, we will see how to calculate F1 score using the harmonic mean of precision and recall.

2.2. Harmonic Mean and F1 Score

Now that we know what precision and recall are, how can we combine them into a single metric? One way to do that is to use the arithmetic mean, which is the simple average of two numbers. However, the arithmetic mean can be skewed by large values, which means it does not reflect the balance between precision and recall. For example, if precision is 0.9 and recall is 0.1, the arithmetic mean is 0.5, which is the same as if precision and recall were both 0.5. This does not capture the fact that the model has a very low recall and a very high precision.

A better way to combine precision and recall is to use the harmonic mean, which is a type of average that gives more weight to low values. The harmonic mean of two numbers is the inverse of the arithmetic mean of their inverses. The formula for the harmonic mean of precision and recall is:

$$\text{Harmonic Mean} = \frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}}$$

The harmonic mean of precision and recall is also known as the F1 score, which is the metric that we are interested in. The formula for F1 score is:

$$\text{F1 Score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

F1 score is a good measure of how balanced the model is between precision and recall, because it penalizes low values more than high values. Therefore, F1 score will be low if either precision or recall is low, which reflects the poor performance of the model. F1 score ranges from 0 to 1, where 0 is the worst and 1 is the best.

Let’s see an example of how to calculate F1 score using the harmonic mean of precision and recall. Suppose you have the same data set and model as in the previous section, where precision is 0.875 and recall is 0.778. The F1 score for this problem is:

$$\text{F1 Score} = \frac{2 \times 0.875 \times 0.778}{0.875 + 0.778} = 0.824$$

This means that your model has a balanced performance between precision and recall, with a F1 score of 0.824. But how can you calculate F1 score in Python using sklearn.metrics? And how can you use F1 score to compare different models and improve your model performance? These are the questions that we will answer in the next sections. Keep reading!

3. How to Calculate F1 Score in Python using sklearn.metrics

In this section, we will see how to calculate F1 score in Python using sklearn.metrics, a module that provides various performance metrics for machine learning tasks. sklearn.metrics has a function called f1_score that takes the true labels and the predicted labels of a classification problem as inputs and returns the F1 score as output. The syntax of the f1_score function is:

from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average=None, labels=None, zero_division=0)

The parameters of the f1_score function are:

y_true: The true labels of the classification problem, as a 1D array or list.
y_pred: The predicted labels of the classification problem, as a 1D array or list.
average: The method of averaging the F1 scores for each class, if the problem is multiclass or multilabel. The possible values are:

None: Returns the F1 scores for each class as an array.
‘binary’: Returns the F1 score for the positive class only, if the problem is binary. The positive class is inferred as the class with the greater label, unless specified by the pos_label parameter.
‘micro’: Returns the F1 score calculated globally by counting the total true positives, false negatives, and false positives.
‘macro’: Returns the F1 score calculated as the unweighted mean of the F1 scores for each class.
‘weighted’: Returns the F1 score calculated as the weighted mean of the F1 scores for each class, where the weight is the number of true instances for each class.

labels: The list of labels to include in the F1 score calculation, if the problem is multiclass or multilabel. If None, all labels in y_true and y_pred are used.
zero_division: The value to return when there is a zero division, i.e. when all predictions are negative. The possible values are 0, 1, or ‘warn’. If ‘warn’, a warning message is printed.

Let’s see some examples of how to use the f1_score function for different types of classification problems.

3.1. Binary Classification Example

In this section, we will see how to use the f1_score function for a binary classification problem, where the positive class is spam and the negative class is not spam. We will use the same data set and model as in the previous section, where precision is 0.875 and recall is 0.778. We will also compare the F1 score with the accuracy, which is the ratio of correct predictions to total predictions.

from sklearn.metrics import f1_score

# The true labels of the classification problem
y_true = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 10 spam emails
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # 10 not spam emails

# The predicted labels of the classification problem
y_pred = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, # 7 spam emails and 3 not spam emails
          1, 1, 1, 1, 1, 0, 0, 0, 0, 0] # 5 spam emails and 5 not spam emails

# Calculate the F1 score for the positive class (spam)
f1 = f1_score(y_true, y_pred, average='binary')

# Print the F1 score
print(f'F1 score: {f1:.3f}')

The output of the code is:

F1 score: 0.667

This means that the F1 score for the positive class (spam) is 0.667, which is lower than the precision (0.875) and the recall (0.778). This reflects the fact that the model is not balanced between precision and recall, and that it has more false negatives than false positives.

To calculate the accuracy, we can use the accuracy_score function from sklearn.metrics, which takes the same inputs as the f1_score function. The code for calculating the accuracy is:

from sklearn.metrics import accuracy_score

# The true labels and the predicted labels are the same as before

# Calculate the accuracy
acc = accuracy_score(y_true, y_pred)

# Print the accuracy
print(f'Accuracy: {acc:.3f}')

The output of the code is:

Accuracy: 0.600

This means that the accuracy is 0.600, which is lower than the F1 score (0.667). This shows that accuracy is not a good indicator of model quality, especially when dealing with imbalanced data sets. F1 score is a better metric to use, as it takes into account both precision and recall.

In the next section, we will see how to use the f1_score function for a multiclass classification problem, where there are more than two classes to predict.

3.2. Multiclass Classification Example

In this section, we will see how to use the f1_score function for a multiclass classification problem, where there are more than two classes to predict. We will use a data set of iris flowers, where the task is to classify each flower into one of three species: setosa, versicolor, or virginica. We will use a simple logistic regression model to make the predictions and compare the F1 score with the accuracy.

To use the f1_score function, we need to import it from sklearn.metrics and pass the true labels and the predicted labels as inputs. We also need to specify the average parameter as one of the possible values: None, ‘micro’, ‘macro’, or ‘weighted’. The choice of the average parameter depends on how we want to aggregate the F1 scores for each class. We will see the difference between these options in the following examples.

First, we need to load the data set and split it into training and testing sets. We can use the load_iris function from sklearn.datasets to load the data set, and the train_test_split function from sklearn.model_selection to split the data set. The code for loading and splitting the data set is:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris data set
iris = load_iris()

# Get the features and the labels
X = iris.data
y = iris.target

# Split the data set into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, we need to train a logistic regression model on the training set and make predictions on the testing set. We can use the LogisticRegression class from sklearn.linear_model to create and fit the model, and the predict method to make predictions. The code for training and testing the model is:

from sklearn.linear_model import LogisticRegression

# Create and fit a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

Now, we can use the f1_score function to calculate the F1 score for the multiclass classification problem. We will try different values of the average parameter and see how they affect the result.

If we set the average parameter to None, the f1_score function will return an array of F1 scores for each class. The code for calculating the F1 score with average=None is:

from sklearn.metrics import f1_score

# Calculate the F1 score for each class
f1 = f1_score(y_test, y_pred, average=None)

# Print the F1 score
print(f'F1 score: {f1}')

The output of the code is:

F1 score: [1.         0.94117647 0.91666667]

This means that the F1 score for the setosa class is 1, which is perfect. The F1 score for the versicolor class is 0.941, which is very good. The F1 score for the virginica class is 0.917, which is also good. However, this output does not give us a single value to compare the model performance with other models or metrics.

If we set the average parameter to ‘micro’, the f1_score function will return the F1 score calculated globally by counting the total true positives, false negatives, and false positives. The code for calculating the F1 score with average=’micro’ is:

from sklearn.metrics import f1_score

# Calculate the F1 score globally
f1 = f1_score(y_test, y_pred, average='micro')

# Print the F1 score
print(f'F1 score: {f1:.3f}')

The output of the code is:

F1 score: 0.967

This means that the F1 score for the multiclass classification problem is 0.967, which is very high. However, this output does not take into account the class imbalance or the distribution of the predictions. For example, if the model predicts all instances as the most frequent class, it will still have a high F1 score with average=’micro’, even though it is a bad model.

If we set the average parameter to ‘macro’, the f1_score function will return the F1 score calculated as the unweighted mean of the F1 scores for each class. The code for calculating the F1 score with average=’macro’ is:

from sklearn.metrics import f1_score

# Calculate the F1 score as the unweighted mean
f1 = f1_score(y_test, y_pred, average='macro')

# Print the F1 score
print(f'F1 score: {f1:.3f}')

The output of the code is:

F1 score: 0.953

This means that the F1 score for the multiclass classification problem is 0.953, which is slightly lower than the F1 score with average=’micro’. This is because the F1 score with average=’macro’ gives equal weight to each class, regardless of their frequency or importance. However, this output does not account for the class imbalance or the distribution of the predictions either. For example, if the model predicts one class very well and the other classes very poorly, it will still have a high F1 score with average=’macro’, even though it is a bad model.

If we set the average parameter to ‘weighted’, the f1_score function will return the F1 score calculated as the weighted mean of the F1 scores for each class, where the weight is the number of true instances for each class. The code for calculating the F1 score with average=’weighted’ is:

from sklearn.metrics import f1_score

# Calculate the F1 score as the weighted mean
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the F1 score
print(f'F1 score: {f1:.3f}')

The output of the code is:

F1 score: 0.967

This means that the F1 score for the multiclass classification problem is 0.967, which is the same as the F1 score with average=’micro’. This is because the F1 score with average=’weighted’ gives more weight to the classes that have more instances, which makes sense in this case, since the data set is balanced. However, this output may not be appropriate for imbalanced data sets, where the minority classes may be more important or relevant than the majority classes. For example, if the model predicts the majority class very well and the minority classes very poorly, it will still have a high F1 score with average=’weighted’, even though it is a bad model.

As you can see, the choice of the average parameter can affect the result of the F1 score calculation, and there is no definitive answer to which one is the best. It depends on the nature of the problem, the data set, and the objective of the model. You should choose the average parameter that best suits your needs and goals.

In the next section, we will see how to interpret F1 score and use it to compare different models and improve model performance.

4. How to Interpret F1 Score and Improve Model Performance

In this section, we will see how to interpret F1 score and use it to compare different models and improve model performance. F1 score is a metric that combines precision and recall, which are two important aspects of classification quality. Precision measures how accurate your predictions are, while recall measures how complete your predictions are. F1 score is the harmonic mean of precision and recall, which means it gives more weight to low values. As a result, F1 score is a good measure of how balanced your model is between precision and recall.

But how can you use F1 score to evaluate and improve your model performance? Here are some tips and tricks:

Compare F1 score with accuracy: Accuracy is the ratio of correct predictions to total predictions, which is a simple and intuitive metric. However, accuracy can be misleading when dealing with imbalanced data sets, where one class is much more frequent than the other. In this case, F1 score is a better metric to use, as it takes into account both precision and recall, which are more sensitive to imbalanced data sets. You should compare F1 score with accuracy to see if your model is biased towards the majority class or not.
Compare F1 score with other models: F1 score is a useful metric to compare different models and choose the best one for your problem. You can use F1 score to compare models with different algorithms, parameters, or features, and see which one has the highest F1 score. You can also use F1 score to compare models with different data sets, and see which one generalizes better to new data. However, you should also consider other factors, such as the complexity, speed, and interpretability of the models, before making the final decision.
Improve F1 score by tuning precision and recall: F1 score is a function of precision and recall, which means you can improve F1 score by improving either precision or recall, or both. However, there is often a trade-off between precision and recall, which means increasing one may decrease the other. Therefore, you should tune precision and recall according to your needs and goals. For example, if you want to minimize false positives, you should increase precision. If you want to minimize false negatives, you should increase recall. You can use different techniques, such as feature selection, regularization, thresholding, or resampling, to tune precision and recall.

By using these tips and tricks, you can use F1 score to evaluate and improve your model performance. F1 score is a powerful and versatile metric that can help you achieve a balanced and robust classification model.

In the next and final section, we will summarize the main points of this tutorial and provide some resources for further learning.

5. Conclusion

In this tutorial, you have learned how to calculate F1 score in Python using sklearn.metrics, a module that provides various performance metrics for machine learning tasks. F1 score is one of the most widely used metrics for evaluating the quality of classification models, especially when dealing with imbalanced data sets.

You have learned what F1 score is and how it is calculated, how to use sklearn.metrics to compute F1 score for binary and multiclass classification problems, how to interpret F1 score and use it to compare different models and improve model performance, and some tips and tricks to use F1 score effectively.

By following this tutorial, you have gained a valuable skill that can help you achieve a balanced and robust classification model. F1 score is a powerful and versatile metric that can help you measure and optimize your model performance.

We hope you enjoyed this tutorial and found it useful. If you want to learn more about F1 score and other performance metrics, you can check out the following resources:

Classification metrics in scikit-learn: A comprehensive guide to the different metrics available in sklearn.metrics for classification problems.
F1 score on Wikipedia: A detailed explanation of the F1 score, its properties, and its variations.
Accuracy, Precision, Recall or F1?: A blog post that explains the differences and trade-offs between these common metrics.

Thank you for reading this tutorial and happy coding!