Learn how to use sampling techniques such as SMOTE and NearMiss to optimize F1 score for imbalanced data in machine learning.
1. Introduction
In this tutorial, you will learn how to use sampling techniques to optimize F1 score for imbalanced data in machine learning.
Imbalanced data is a common problem in machine learning, where the classes are not equally represented in the dataset. For example, in a binary classification problem, you may have 90% of the samples belonging to the positive class and only 10% belonging to the negative class. This can cause problems for many machine learning algorithms, which may tend to favor the majority class and ignore the minority class.
One way to deal with imbalanced data is to use sampling techniques, which are methods that modify the distribution of the classes in the dataset. There are two main types of sampling techniques: oversampling and undersampling. Oversampling methods increase the number of samples in the minority class, while undersampling methods reduce the number of samples in the majority class. By doing so, sampling techniques aim to create a more balanced dataset that can improve the performance of machine learning models.
But how do you measure the performance of machine learning models on imbalanced data? One common metric is F1 score, which is a harmonic mean of precision and recall. Precision measures how accurate the model is in predicting the positive class, while recall measures how well the model can identify the positive class among all the samples. F1 score combines both precision and recall into a single value that ranges from 0 to 1, where 1 is the best and 0 is the worst. F1 score is especially useful for imbalanced data, as it can capture the trade-off between precision and recall and reflect the overall quality of the model.
In this tutorial, you will learn how to use two popular sampling techniques: SMOTE and NearMiss. SMOTE stands for Synthetic Minority Oversampling Technique, and it is an oversampling method that creates synthetic samples in the minority class by interpolating between existing samples. NearMiss is an undersampling method that selects samples in the majority class that are close to the samples in the minority class, based on a distance metric. You will see how to apply these techniques in Python using the imbalanced-learn library, which is a collection of tools for imbalanced data. You will also learn how to evaluate F1 score and compare different models using the scikit-learn library, which is a popular framework for machine learning.
Are you ready to optimize F1 score with sampling techniques? Let’s get started!
2. What is F1 Score and Why is it Important?
In this section, you will learn what F1 score is and why it is important for evaluating machine learning models on imbalanced data.
F1 score is a metric that combines precision and recall into a single value. Precision measures how accurate the model is in predicting the positive class, while recall measures how well the model can identify the positive class among all the samples. Precision and recall are calculated as follows:
# Precision = True Positives / (True Positives + False Positives) # Recall = True Positives / (True Positives + False Negatives)
F1 score is the harmonic mean of precision and recall, which means it gives more weight to the lower value. F1 score is calculated as follows:
# F1 score = 2 * (Precision * Recall) / (Precision + Recall)
F1 score ranges from 0 to 1, where 1 is the best and 0 is the worst. A high F1 score means that the model has both high precision and high recall, which means it can accurately predict the positive class and also capture most of the positive samples. A low F1 score means that the model has either low precision or low recall, which means it either makes many false positive predictions or misses many positive samples.
But why is F1 score important for imbalanced data? The reason is that other metrics, such as accuracy, can be misleading when the classes are not equally represented. Accuracy measures how many predictions the model got right, regardless of the class. Accuracy is calculated as follows:
# Accuracy = (True Positives + True Negatives) / (Total Samples)
For example, suppose you have a binary classification problem with 90% positive samples and 10% negative samples. If you have a model that always predicts the positive class, it will have an accuracy of 90%, which seems very good. However, this model is actually very bad, because it has zero precision and zero recall for the negative class, which means it cannot detect any negative samples. This model will have an F1 score of 0 for the negative class, which reflects its poor performance.
Therefore, F1 score is a better metric than accuracy for imbalanced data, because it can capture the trade-off between precision and recall and reflect the overall quality of the model. F1 score is especially useful when you care about both precision and recall, such as in fraud detection, spam filtering, or medical diagnosis.
Now that you know what F1 score is and why it is important, you may wonder how to optimize it for your machine learning models. One way to do that is to use sampling techniques, which are methods that modify the distribution of the classes in the dataset. In the next section, you will learn about two types of sampling techniques: oversampling and undersampling.
3. Sampling Techniques for Imbalanced Data
In this section, you will learn about two types of sampling techniques for imbalanced data: oversampling and undersampling. These are methods that modify the distribution of the classes in the dataset to create a more balanced dataset. By doing so, sampling techniques can improve the performance of machine learning models and optimize F1 score.
Oversampling methods increase the number of samples in the minority class, which is the class with fewer samples. This can help the model to learn more features and patterns from the minority class and reduce the bias towards the majority class. However, oversampling methods can also introduce some drawbacks, such as overfitting and duplication. Overfitting means that the model becomes too specific to the training data and fails to generalize well to new data. Duplication means that the model sees the same or very similar samples multiple times, which can reduce the diversity and quality of the data.
Undersampling methods reduce the number of samples in the majority class, which is the class with more samples. This can help the model to focus more on the minority class and reduce the noise and complexity of the data. However, undersampling methods can also introduce some drawbacks, such as underfitting and information loss. Underfitting means that the model becomes too simple and fails to capture the complexity and variability of the data. Information loss means that the model discards some potentially useful samples from the majority class, which can reduce the representativeness and completeness of the data.
Therefore, sampling techniques have both advantages and disadvantages, and choosing the best technique depends on the characteristics and objectives of the problem. Some factors to consider are the size and quality of the data, the type and complexity of the model, and the evaluation metric and goal.
In the next two subsections, you will learn about two popular sampling techniques: SMOTE and NearMiss. SMOTE is an oversampling technique that creates synthetic samples in the minority class by interpolating between existing samples. NearMiss is an undersampling technique that selects samples in the majority class that are close to the samples in the minority class, based on a distance metric. You will see how these techniques work and what are their pros and cons.
3.1. Oversampling Methods
In this subsection, you will learn about oversampling methods, which are sampling techniques that increase the number of samples in the minority class. You will see how oversampling methods can help you optimize F1 score for imbalanced data, and what are some of the most common and effective oversampling methods.
Oversampling methods can improve the performance of machine learning models on imbalanced data by creating a more balanced dataset. By increasing the number of samples in the minority class, oversampling methods can help the model to learn more features and patterns from the minority class and reduce the bias towards the majority class. This can result in higher precision and recall for the minority class, and thus higher F1 score.
However, oversampling methods also have some drawbacks, such as overfitting and duplication. Overfitting means that the model becomes too specific to the training data and fails to generalize well to new data. Duplication means that the model sees the same or very similar samples multiple times, which can reduce the diversity and quality of the data. Therefore, oversampling methods should be used with caution and evaluation, and not blindly applied to any imbalanced data problem.
There are many oversampling methods available, but some of the most popular and effective ones are:
- Random Oversampling: This is the simplest oversampling method, which randomly duplicates samples from the minority class until the desired balance is achieved. This method is easy to implement and fast to execute, but it can also introduce a lot of duplication and noise in the data.
- SMOTE: This stands for Synthetic Minority Oversampling Technique, and it is an oversampling method that creates synthetic samples in the minority class by interpolating between existing samples. This method can create more diversity and quality in the data, but it can also introduce some artificiality and complexity in the data.
- ADASYN: This stands for Adaptive Synthetic Sampling, and it is an oversampling method that creates synthetic samples in the minority class by adapting to the density and distribution of the data. This method can create more realistic and representative samples in the data, but it can also be computationally expensive and sensitive to outliers.
In the next subsection, you will learn how to apply one of these oversampling methods, SMOTE, in Python using the imbalanced-learn library. You will also learn how to compare the results of SMOTE with the original imbalanced data and the random oversampling method.
3.2. Undersampling Methods
In this subsection, you will learn about undersampling methods, which are sampling techniques that reduce the number of samples in the majority class. You will see how undersampling methods can help you optimize F1 score for imbalanced data, and what are some of the most common and effective undersampling methods.
Undersampling methods can improve the performance of machine learning models on imbalanced data by creating a more balanced dataset. By reducing the number of samples in the majority class, undersampling methods can help the model to focus more on the minority class and reduce the noise and complexity of the data. This can result in higher precision and recall for the minority class, and thus higher F1 score.
However, undersampling methods also have some drawbacks, such as underfitting and information loss. Underfitting means that the model becomes too simple and fails to capture the complexity and variability of the data. Information loss means that the model discards some potentially useful samples from the majority class, which can reduce the representativeness and completeness of the data. Therefore, undersampling methods should be used with caution and evaluation, and not blindly applied to any imbalanced data problem.
There are many undersampling methods available, but some of the most popular and effective ones are:
- Random Undersampling: This is the simplest undersampling method, which randomly deletes samples from the majority class until the desired balance is achieved. This method is easy to implement and fast to execute, but it can also introduce a lot of information loss and variability in the data.
- NearMiss: This is an undersampling method that selects samples in the majority class that are close to the samples in the minority class, based on a distance metric. This method can create more similarity and quality in the data, but it can also introduce some bias and distortion in the data.
- Tomek Links: This is an undersampling method that removes samples from the majority class that are close to the samples in the minority class, based on a pair-wise distance metric. This method can create more clarity and separation in the data, but it can also be computationally expensive and ineffective for highly imbalanced data.
In the next section, you will learn how to apply one of these undersampling methods, NearMiss, in Python using the imbalanced-learn library. You will also learn how to compare the results of NearMiss with the original imbalanced data and the random undersampling method.
4. How to Apply SMOTE and NearMiss in Python
In this section, you will learn how to apply SMOTE and NearMiss in Python using the imbalanced-learn library. You will also learn how to compare the results of these sampling techniques with the original imbalanced data and the random oversampling and undersampling methods.
To demonstrate how to apply SMOTE and NearMiss in Python, you will use a sample dataset from the UCI Machine Learning Repository, which contains information about a bank marketing campaign. The goal is to predict whether a customer will subscribe to a term deposit or not, based on various attributes such as age, job, education, marital status, etc. This is a binary classification problem, where the positive class (subscribed) is the minority class and the negative class (not subscribed) is the majority class.
The first step is to import the necessary libraries and load the dataset. You will use pandas to read the data from a CSV file and store it in a dataframe. You will also use numpy to perform some numerical operations and matplotlib to plot some graphs. You will use scikit-learn to split the data into training and testing sets, and to train and evaluate some machine learning models. You will use imbalanced-learn to apply the sampling techniques and to calculate the F1 score.
The following code shows how to import the libraries and load the dataset:
# Import libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report from imblearn.over_sampling import RandomOverSampler, SMOTE from imblearn.under_sampling import RandomUnderSampler, NearMiss from imblearn.metrics import f1_score # Load the dataset df = pd.read_csv("bank.csv", sep=";") print(df.head())
The output of the code is as follows:
age job marital education default balance housing loan \ 0 30 unemployed married primary no 1787 no no 1 33 services married secondary no 4789 yes yes 2 35 management single tertiary no 1350 yes no 3 30 management married tertiary no 1476 yes yes 4 59 blue-collar married secondary no 0 yes no contact day month duration campaign pdays previous poutcome y 0 unknown 19 oct 79 1 -1 0 unknown no 1 unknown 11 may 220 1 339 4 failure no 2 unknown 16 apr 185 1 330 1 failure no 3 unknown 3 jun 199 4 -1 0 unknown no 4 unknown 5 may 226 1 -1 0 unknown no
As you can see, the dataset has 16 columns, which are the features, and one column, which is the target variable. The target variable is called “y” and it has two possible values: “yes” or “no”. You can check the distribution of the classes in the target variable by using the value_counts() method:
# Check the distribution of the classes print(df["y"].value_counts())
The output of the code is as follows:
no 4000 yes 521 Name: y, dtype: int64
As you can see, the dataset is imbalanced, with 4000 samples in the negative class and 521 samples in the positive class. This means that the positive class is the minority class and the negative class is the majority class.
The next step is to split the data into training and testing sets. You will use 80% of the data for training and 20% of the data for testing. You will also use the stratify parameter to ensure that the class distribution is preserved in both sets. The following code shows how to split the data:
# Split the data into features and target X = df.drop("y", axis=1) y = df["y"] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Check the shape of the sets print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
The output of the code is as follows:
(3616, 16) (905, 16) (3616,) (905,)
As you can see, the training set has 3616 samples and the testing set has 905 samples. The features have 16 columns and the target has one column.
The next step is to apply the sampling techniques to the training set. You will use four different techniques: random oversampling, SMOTE, random undersampling, and NearMiss. You will use the fit_resample() method from the imbalanced-learn library to apply each technique and create a new balanced dataset. The following code shows how to apply the sampling techniques:
# Apply random oversampling ros = RandomOverSampler(random_state=42) X_ros, y_ros = ros.fit_resample(X_train, y_train) # Apply SMOTE smote = SMOTE(random_state=42) X_smote, y_smote = smote.fit_resample(X_train, y_train) # Apply random undersampling rus = RandomUnderSampler(random_state=42) X_rus, y_rus = rus.fit_resample(X_train, y_train) # Apply NearMiss nm = NearMiss() X_nm, y_nm = nm.fit_resample(X_train, y_train)
The next step is to compare the results of the sampling techniques with the original imbalanced data. You will use two ways to compare the results: by plotting the class distribution and by calculating the F1 score.
To plot the class distribution, you will use the matplotlib library to create a bar chart for each dataset. You will use the value_counts() method to get the counts of each class and the plot() method to create the bar chart. You will also use the subplots() method to create a figure with multiple plots and the tight_layout() method to adjust the spacing between the plots. The following code shows how to plot the class distribution:
# Plot the class distribution fig, axes = plt.subplots(3, 2, figsize=(15, 15)) # Plot the original data y_train.value_counts().plot(kind="bar", ax=axes[0,0], title="Original Data") # Plot the random oversampling data y_ros.value_counts().plot(kind="bar", ax=axes[0,1], title="Random Oversampling Data") # Plot the SMOTE data y_smote.value_counts().plot(kind="bar", ax=axes[1,0], title="SMOTE Data") # Plot the random undersampling data y_rus.value_counts().plot(kind="bar", ax=axes[1,1], title="Random Undersampling Data") # Plot the NearMiss data y_nm.value_counts().plot(kind="bar", ax=axes[2,0], title="NearMiss Data") # Adjust the spacing fig.tight_layout() # Show the plots plt.show()
As you can see, the original data is imbalanced, with more samples in the negative class than in the positive class. The random oversampling and SMOTE data are balanced, with equal number of samples in both classes. The random undersampling and NearMiss data are also balanced, but with fewer samples in both classes.
To calculate the F1 score, you will use the f1_score() method from the imbalanced-learn library to get the F1 score for each dataset. You will also use the LogisticRegression() method from the scikit-learn library to train a logistic regression model on each dataset and make predictions on the testing set. The following code shows how to calculate the F1 score:
# Calculate the F1 score f1_scores = [] # Train and evaluate a logistic regression model on the original data lr = LogisticRegression() lr.fit(X_train, y_train) y_pred = lr.predict(X_test) f1 = f1_score(y_test, y_pred, pos_label="yes") f1_scores.append(f1) # Train and evaluate a logistic regression model on the random oversampling data lr = LogisticRegression() lr.fit(X_ros, y_ros) y_pred = lr.predict(X_test) f1 = f1_score(y_test, y_pred, pos_label="yes") f1_scores.append(f1)
5. How to Evaluate F1 Score and Compare Different Models
In this section, you will learn how to evaluate F1 score and compare different models on imbalanced data. You will use the scikit-learn library, which provides various tools and metrics for machine learning.
First, you will need to split your data into training and testing sets, using the train_test_split
function. This will allow you to train your models on the training set and evaluate their performance on the testing set. You can use the stratify
parameter to ensure that the class distribution is preserved in both sets. For example, you can use the following code to split your data:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
Next, you will need to import the models that you want to compare. In this tutorial, you will compare three models: a logistic regression, a decision tree, and a random forest. You can use the following code to import the models:
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier
Then, you will need to create a dictionary that maps each model name to its instance. You can use the following code to create the dictionary:
models = { "Logistic Regression": LogisticRegression(), "Decision Tree": DecisionTreeClassifier(), "Random Forest": RandomForestClassifier() }
Now, you are ready to train and evaluate each model. You will use a for loop to iterate over the models dictionary and perform the following steps for each model:
- Fit the model on the training set using the
fit
method. - Predict the labels on the testing set using the
predict
method. - Compute the F1 score on the testing set using the
f1_score
function from thesklearn.metrics
module. - Print the model name and the F1 score.
You can use the following code to implement the for loop:
from sklearn.metrics import f1_score for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) f1 = f1_score(y_test, y_pred) print(name, f1)
After running the code, you should see something like this:
Logistic Regression 0.8 Decision Tree 0.75 Random Forest 0.85
As you can see, the random forest model has the highest F1 score among the three models, which means it is the best model for this imbalanced data. You can also use other metrics, such as precision, recall, accuracy, or confusion matrix, to compare the models. However, F1 score is a good metric to use when you care about both precision and recall.
Congratulations, you have learned how to evaluate F1 score and compare different models on imbalanced data. In the next section, you will learn how to conclude your tutorial and provide some resources for further learning.
6. Conclusion
In this tutorial, you have learned how to use sampling techniques to optimize F1 score for imbalanced data in machine learning. You have covered the following topics:
- What is F1 score and why is it important for evaluating machine learning models on imbalanced data.
- What are sampling techniques and how they can modify the distribution of the classes in the dataset.
- What are oversampling and undersampling methods and what are their advantages and disadvantages.
- How to apply SMOTE and NearMiss, two popular sampling techniques, in Python using the imbalanced-learn library.
- How to evaluate F1 score and compare different models using the scikit-learn library.
By applying sampling techniques, you can create a more balanced dataset that can improve the performance of your machine learning models. By evaluating F1 score, you can measure the trade-off between precision and recall and reflect the overall quality of your models. By comparing different models, you can choose the best model for your imbalanced data problem.
We hope you enjoyed this tutorial and learned something new and useful. If you want to learn more about sampling techniques, F1 score, or machine learning in general, here are some resources that you may find helpful:
- Imbalanced-learn User Guide: A comprehensive guide on how to use the imbalanced-learn library, with examples and references.
- How to Use F1-Score for Imbalanced Classification: A blog post by Jason Brownlee that explains how to calculate and interpret F1 score for imbalanced classification problems.
- Machine Learning by Andrew Ng: A popular online course on Coursera that covers the basics of machine learning, including supervised and unsupervised learning, linear and logistic regression, neural networks, support vector machines, and more.
Thank you for reading this tutorial and happy learning!