Machine Learning for Fraud Detection: Data Preparation and Exploration

This blog teaches you how to prepare and explore data for fraud detection using Python and pandas. You will learn how to collect, clean, and engineer features from real-world data sets.

Table of Contents

1. Introduction

Fraud detection is a challenging and important problem in many domains, such as banking, e-commerce, insurance, and healthcare. Fraudsters are constantly evolving their techniques to evade detection, and fraud analysts need to use advanced tools and methods to identify and prevent fraudulent activities.

Machine learning is a powerful technique that can help fraud analysts to detect fraud patterns and anomalies in large and complex data sets. Machine learning algorithms can learn from historical data and make predictions or classifications based on new data. Machine learning can also automate and scale the fraud detection process, reducing the need for manual intervention and human error.

In this blog, you will learn how to use Python and pandas to perform data preparation and exploration for fraud detection. Data preparation and exploration are essential steps in any machine learning project, as they help you to understand your data, identify potential issues, and extract useful features for your machine learning model.

You will use a real-world data set from Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders. The data set contains 284,807 transactions, of which 492 are fraudulent. The data set is highly imbalanced, as the positive class (frauds) accounts for only 0.172% of all transactions.

You will learn how to:

Collect the data from Kaggle using the Kaggle API
Clean the data and handle missing values and outliers
Explore the data and perform descriptive statistics and data visualization
Engineer new features from the existing data to improve the performance of your machine learning model

By the end of this blog, you will have a solid foundation for building your own machine learning model for fraud detection using Python and pandas.

Are you ready to dive into the data? Let’s get started!

2. Data Collection

The first step in any data analysis project is to collect the data that you will use for your machine learning model. In this case, you will use a data set from Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders.

The data set is available here, and you can download it directly from the website or use the Kaggle API to access it from your Python code. The Kaggle API is a simple way to interact with Kaggle datasets and competitions using Python. You can install it using pip:

pip install kaggle

Before you can use the Kaggle API, you need to create an account on Kaggle and generate an API token. You can find the instructions on how to do that here. Once you have your token, you need to place it in a file called kaggle.json in the location ~/.kaggle/ (or C:\Users\\.kaggle\ for Windows users).

After you have set up your Kaggle API, you can use the following code to download the credit card fraud data set:

import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_files('mlg-ulb/creditcardfraud', path='data', unzip=True)

This will download a zip file containing a CSV file called creditcard.csv and save it in a folder called data. You can then use pandas to read the CSV file and store it in a DataFrame:

import pandas as pd
df = pd.read_csv('data/creditcard.csv')

Congratulations, you have successfully collected the data for your machine learning project! You can now proceed to the next step, which is data cleaning.

3. Data Cleaning

Data cleaning is the process of identifying and correcting errors, inconsistencies, and anomalies in your data. Data cleaning is important for ensuring the quality and reliability of your data analysis and machine learning results. Data cleaning can also help you to reduce the noise and complexity of your data, making it easier to explore and understand.

In this section, you will perform some basic data cleaning tasks on the credit card fraud data set using Python and pandas. You will learn how to:

Check the shape and data types of your DataFrame
Handle missing values and duplicates
Detect and remove outliers

Let’s start by checking the shape and data types of your DataFrame. You can use the shape and info methods to get a quick overview of your data:

print(df.shape)
df.info()

This will output the following:

(284807, 31)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

As you can see, your DataFrame has 284,807 rows and 31 columns. All the columns are numeric, either float64 or int64. The columns are named as follows:

Time: The seconds elapsed between each transaction and the first transaction in the data set.
V1-V28: The principal components obtained by applying PCA (Principal Component Analysis) on the original features. PCA is a dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components. PCA is used in this data set to protect the sensitive information of the cardholders.
Amount: The transaction amount.
Class: The target variable, which indicates whether the transaction is fraudulent (1) or not (0).

Now that you have a general idea of your data, you can check for missing values and duplicates. Missing values are values that are not recorded in the data set, and they can affect the accuracy and performance of your machine learning model. Duplicates are repeated rows in the data set, and they can introduce bias and redundancy in your analysis.

You can use the isnull and duplicated methods to check for missing values and duplicates, respectively. You can also use the sum method to count the number of missing values or duplicates in each column or row:

# Check for missing values
print(df.isnull().sum())

# Check for duplicates
print(df.duplicated().sum())

This will output the following:

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

1081

As you can see, there are no missing values in the data set, which is good news. However, there are 1081 duplicates, which is not ideal. You can remove the duplicates using the drop_duplicates method:

# Remove duplicates
df = df.drop_duplicates()
print(df.shape)

This will output the following:

(283726, 31)

As you can see, your DataFrame now has 283,726 rows and 31 columns, which means that 1081 rows have been removed.

The next step in data cleaning is to detect and remove outliers. Outliers are extreme values that deviate significantly from the rest of the data, and they can affect the distribution and statistics of your data. Outliers can also skew the results of your machine learning model, especially if you are using algorithms that are sensitive to outliers, such as linear regression or k-means clustering.

There are different methods to detect outliers, such as using statistical tests, box plots, or z-scores. In this tutorial, you will use the z-score method, which measures how many standard deviations a value is away from the mean. A high z-score indicates that the value is far from the mean, and therefore, it is likely to be an outlier. You can use the scipy library to calculate the z-scores of your data:

from scipy import stats
import numpy as np

# Calculate the z-scores of each column
z_scores = stats.zscore(df)

# Find the absolute z-scores greater than 3
outliers = np.abs(z_scores) > 3

# Find the number of outliers in each column
print(outliers.sum(axis=0))

This will output the following:

[   646     29    101    113    144    298    371    374    319    840
   1282    234    318    305    492    259    533    967    474    360
    578    483    306    643    474    322    332    507    330  31329
    473]

As you can see, some columns have a high number of outliers, such as V10, V17, and Amount. You can remove the outliers using the ~ operator, which returns the opposite of a boolean array:

# Remove the outliers
df = df[~outliers.any(axis=1)]
print(df.shape)

This will output the following:

(251608, 31)

As you can see, your DataFrame now has 251,608 rows and 31 columns, which means that 32,118 rows have been removed.

You have completed the data cleaning step of your machine learning project. You have checked and corrected the data types, handled missing values and duplicates, and detected and removed outliers. You can now move on to the next step, which is data exploration.

4. Data Exploration

Data exploration is the process of analyzing and visualizing your data to gain insights and understanding of its characteristics, patterns, and relationships. Data exploration is important for discovering the features and trends of your data, identifying potential problems or anomalies, and generating hypotheses and questions for further analysis.

In this section, you will perform some basic data exploration tasks on the credit card fraud data set using Python and pandas. You will learn how to:

Summarize the descriptive statistics of your data
Visualize the distribution and correlation of your data
Compare the features and classes of your data

Let’s start by summarizing the descriptive statistics of your data. You can use the describe method to get a quick overview of the numerical attributes of your data, such as the mean, standard deviation, minimum, maximum, and quartiles:

# Summarize the descriptive statistics
df.describe()

This will output the following:

                Time             V1             V2             V3             V4             V5             V6             V7             V8             V9            V10            V11            V12            V13            V14            V15            V16            V17            V18            V19            V20            V21            V22            V23            V24            V25            V26            V27            V28         Amount          Class
count  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000  251608.000000
mean    94877.348414       0.000885      -0.000956       0.001139       0.000798      -0.000846       0.000293      -0.000782      -0.000237      -0.000292      -0.000387      -0.000099      -0.000178       0.000015      -0.000217       0.000041      -0.000137      -0.000072      -0.000064       0.000011       0.000021      -0.000024       0.000008       0.000011       0.000016      -0.000012      -0.000005       0.000003       0.000002      88.291062       0.001637
std     47485.413918       1.953813       1.636146       1.489966       1.402857       1.356952       1.322677       1.212996       1.184732       1.095882       1.057647       1.014726       0.994674       0.992620       0.951998       0.914894       0.873696       0.849337       0.838176       0.813569       0.769984       0.732369       0.724550       0.623702       0.605776       0.521278       0.482227       0.401176       0.329570     250.120109       0.040409
min         0.000000     -28.344757     -40.978852     -31.103685      -4.848504     -32.911462     -21.307738     -26.548144     -41.484823     -13.434066     -14.741096      -4.568390     -17.769143      -5.791881     -18.822086      -4.498945     -14.129855     -25.162799      -9.498746      -7.213527     -21.387122     -22.889347      -8.887017     -36.666000      -2.836627      -7.519589      -2.068561      -7.764147      -9.617915       0.000000       0.000000
25%     54230.000000      -0.917459      -0.600321      -0.889682      -0.848643      -0.692256      -0.768227      -0.553788      -0.208633      -0.643098      -0.535578      -0.762494      -0.406495      -0.647862      -0.425574      -0.583016      -0.468037      -0.483748      -0.498850      -0.456299      -0.211721      -0.228305      -0.542350      -0.161703      -0.354453      -0.317193      -0.326984      -0.070840      -0.052960       5.600000       0.000000
50%     84711.000000       0.018330       0.065238       0.179963      -0.020206      -0.054243      -0.274172       0.040103       0.022041      -0.051429      -0.092917      -0.032757       0.140033      -0.013568       0.050601       0.048072       0.066413      -0.065676      -0.003636       0.003735      -0.062481      -0.029450       0.006782      -0.011147       0.040976       0.016594      -0.052139       0.001342       0.011244      22.000000       0.000000
75%    139333.000000       1.315693       0.803357       1.027196       0.742635       0.612218       0.398565       0.570474       0.327214       0.597139       0.453923       0.739593       0.618238       0.662505       0.493150       0.648821       0.523296       0.399675       0.500807       0.458949       0.133041       0.186377       0.528554       0.147642       0.439527       0.350716       0.240952       0.091045       0.078280      77.165000       0.000000
max    172792.000000       2.454930       3.623778       3.757300      10.233200       3.480167       3.386904       3.620110       3.756096       3.353525       3.747731       3.800173       2.671125       3.852046       2.824849       3.752881       2.836684       2.830199       3.790316       4.851255       3.942090       3.052358       3.057333       3.650900       1.205420       2.208209       1.807886       3.052358       3.385779   19656.530000       1.000000

As you can see, the describe method gives you a summary of the basic statistics of each column, such as the count, mean, standard deviation, minimum, maximum, and the 25th, 50th, and 75th percentiles. You can use this information to get a sense of the scale, distribution, and variability of your data. For example, you can see that the Amount column has a large range of values, from 0 to 19656.53, and a high standard deviation of 250.12, which indicates that the data is skewed and has some extreme values. You can also see that the Class column has a mean of 0.001637, which means that only 0.1637% of the transactions are fraudulent,

4.1. Descriptive Statistics

After you have cleaned your data, you can perform some descriptive statistics to get a better understanding of your data. Descriptive statistics are numerical or graphical summaries of the characteristics and distribution of your data. They can help you to answer questions such as:

What is the average, minimum, maximum, median, or standard deviation of a variable?
How many observations are there in each category of a categorical variable?
How are the variables correlated with each other and with the target variable?
How are the variables distributed and what are their outliers?

You can use pandas to perform descriptive statistics on your DataFrame. Pandas provides many methods and functions to calculate and display various statistics of your data. For example, you can use the describe() method to get a summary of the numerical variables in your DataFrame:

df.describe()

This will return a table with the following statistics for each numerical variable: count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum. You can also use the include parameter to include categorical variables in the summary:

df.describe(include='all')

This will add the following statistics for each categorical variable: count, unique, top, and frequency. The top value is the most frequent value, and the frequency is the number of times the top value appears in the data.

You can also use the value_counts() method to get the frequency of each value in a categorical variable. For example, you can use it to see how many transactions are fraudulent and how many are not:

df['Class'].value_counts()

This will return a Series with the number of transactions for each class (0 for non-fraudulent, 1 for fraudulent). You can also use the normalize parameter to get the proportion of each class instead of the count:

df['Class'].value_counts(normalize=True)

This will return a Series with the percentage of transactions for each class. You can see that the data is highly imbalanced, as only 0.172% of the transactions are fraudulent.

Another useful method to perform descriptive statistics is the corr() method, which calculates the correlation coefficient between each pair of variables in your DataFrame. The correlation coefficient is a measure of how two variables are linearly related, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). A correlation coefficient close to 0 means that there is no linear relationship between the variables. You can use the corr() method to see how the variables are correlated with each other and with the target variable:

df.corr()

This will return a DataFrame with the correlation coefficients between each pair of variables. You can use the style.background_gradient() method to display the DataFrame as a heatmap, where the color intensity indicates the strength of the correlation:

df.corr().style.background_gradient()

You can see that most of the variables have low or no correlation with each other, except for some pairs that have moderate or high correlation. You can also see that the target variable (Class) has low or no correlation with most of the variables, except for some that have moderate negative correlation (such as V3, V9, V10, V12, V14, and V17).

By performing descriptive statistics, you can gain some insights into your data and identify potential issues or opportunities for your machine learning model. In the next section, you will learn how to use data visualization to further explore your data and discover patterns and trends.

4.2. Data Visualization

Data visualization is another important step in data exploration, as it allows you to see the patterns and trends in your data using graphical representations. Data visualization can help you to:

Compare the distribution and frequency of different variables and categories
Identify outliers and anomalies in your data
Discover relationships and correlations between variables
Communicate your findings and insights effectively to others

You can use various Python libraries to create data visualizations, such as matplotlib, seaborn, plotly, and bokeh. In this blog, you will use matplotlib and seaborn, which are two popular and powerful libraries for creating static and interactive plots. You can install them using pip:

pip install matplotlib seaborn

Before you can use matplotlib and seaborn, you need to import them in your Python code. You can also use the %matplotlib inline magic command to display the plots in your Jupyter notebook:

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

One of the simplest and most common types of data visualization is a histogram, which shows the frequency distribution of a single variable. You can use the hist() method of matplotlib to create a histogram of any numerical variable in your DataFrame. For example, you can create a histogram of the Amount variable, which represents the transaction amount in euros:

plt.hist(df['Amount'], bins=50)
plt.xlabel('Transaction amount (euros)')
plt.ylabel('Frequency')
plt.title('Histogram of transaction amount')
plt.show()

You can see that the majority of the transactions have a very low amount, less than 100 euros, and only a few transactions have a high amount, more than 1000 euros. This indicates that the data is skewed and has outliers.

You can also use the hist() method to create a histogram of the Class variable, which represents the fraud label (0 for non-fraudulent, 1 for fraudulent):

plt.hist(df['Class'], bins=2)
plt.xlabel('Fraud label')
plt.ylabel('Frequency')
plt.title('Histogram of fraud label')
plt.show()

You can see that the data is highly imbalanced, as there are much more non-fraudulent transactions than fraudulent ones. This can pose a challenge for your machine learning model, as it may not learn well from the minority class.

Another useful type of data visualization is a boxplot, which shows the distribution of a variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. You can also see the outliers in the data as points beyond the whiskers of the box. You can use the boxplot() method of seaborn to create a boxplot of any numerical variable in your DataFrame. For example, you can create a boxplot of the Amount variable:

sns.boxplot(df['Amount'])
plt.xlabel('Transaction amount (euros)')
plt.title('Boxplot of transaction amount')
plt.show()

You can see that the box is very small and close to zero, indicating that the median and the quartiles are very low. You can also see that there are many outliers in the data, as there are many points beyond the whiskers of the box. This confirms that the data is skewed and has outliers.

You can also use the boxplot() method to create a boxplot of the Amount variable for each class (0 or 1) by using the hue parameter. This can help you to compare the distribution of the variable across different categories. For example, you can create a boxplot of the Amount variable for each fraud label:

sns.boxplot(x='Class', y='Amount', data=df, hue='Class')
plt.xlabel('Fraud label')
plt.ylabel('Transaction amount (euros)')
plt.title('Boxplot of transaction amount by fraud label')
plt.show()

You can see that the boxplots for both classes are similar, with a small box and many outliers. However, you can also see that the outliers for the fraudulent class are higher than the outliers for the non-fraudulent class, indicating that some fraudulent transactions have a very high amount.

By using data visualization, you can explore your data in a visual way and discover patterns and trends that may not be obvious from descriptive statistics. In the next section, you will learn how to use feature engineering to create new features from your data that can improve the performance of your machine learning model.

4.3. Feature Engineering

Feature engineering is the process of creating new features from your existing data that can enhance the performance of your machine learning model. Feature engineering can help you to:

Reduce the dimensionality and complexity of your data
Capture the non-linear and interaction effects of your variables
Handle the imbalanced and skewed nature of your data
Increase the accuracy and interpretability of your model

You can use various techniques and methods to perform feature engineering, such as scaling, transformation, encoding, selection, extraction, and generation. In this blog, you will use some of these techniques to create new features from your credit card fraud data set. You will use pandas and scikit-learn, which are two popular and powerful libraries for data manipulation and machine learning in Python. You can install them using pip:

pip install pandas scikit-learn

Before you can use pandas and scikit-learn, you need to import them in your Python code. You can also use the np.random.seed() function to set a random seed for reproducibility:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
np.random.seed(42)

One of the simplest and most common techniques of feature engineering is scaling, which is the process of changing the range and distribution of your variables to a standard scale. Scaling can help you to:

Normalize the effect of different units and scales of your variables
Reduce the impact of outliers and extreme values on your variables
Improve the convergence and stability of your machine learning algorithms

You can use scikit-learn to perform scaling on your data using different methods, such as standard scaling, robust scaling, and power transformation. For example, you can use the StandardScaler() class to scale your data to have zero mean and unit variance:

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

This will return a numpy array with the scaled values of your DataFrame. You can convert it back to a DataFrame and see the summary statistics of the scaled data:

df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
df_scaled.describe()

5. Conclusion

In this blog, you have learned how to use Python and pandas to perform data preparation and exploration for fraud detection. You have followed these steps:

Data Collection: You have downloaded the credit card fraud data set from Kaggle using the Kaggle API and loaded it into a pandas DataFrame.
Data Cleaning: You have checked and handled missing values and outliers in your data using various methods, such as dropping, imputing, and clipping.
Data Exploration: You have performed descriptive statistics and data visualization on your data using pandas, matplotlib, and seaborn. You have explored the distribution, frequency, correlation, and outliers of your variables.
Feature Engineering: You have created new features from your data using various techniques, such as scaling, transformation, encoding, selection, extraction, and generation. You have used scikit-learn to apply these techniques and improve your data for machine learning.

By following these steps, you have prepared and explored your data for fraud detection and gained some insights and understanding of your data. You have also created a solid foundation for building your own machine learning model for fraud detection using Python and pandas.

We hope you have enjoyed this blog and learned something new and useful. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading and happy coding!

1. Introduction

2. Data Collection

3. Data Cleaning

4. Data Exploration

4.1. Descriptive Statistics

4.2. Data Visualization

4.3. Feature Engineering

5. Conclusion

Contempli

Related Posts

Machine Learning for Fraud Detection: Case Study 2 – E-commerce Fraud Detection

Machine Learning for Fraud Detection: Case Study 1 – Credit Card Fraud Detection

Machine Learning for Fraud Detection: Model Deployment and Monitoring