This blog teaches you how to prepare and explore data for fraud detection using Python and pandas. You will learn how to collect, clean, and engineer features from real-world data sets.
1. Introduction
Fraud detection is a challenging and important problem in many domains, such as banking, e-commerce, insurance, and healthcare. Fraudsters are constantly evolving their techniques to evade detection, and fraud analysts need to use advanced tools and methods to identify and prevent fraudulent activities.
Machine learning is a powerful technique that can help fraud analysts to detect fraud patterns and anomalies in large and complex data sets. Machine learning algorithms can learn from historical data and make predictions or classifications based on new data. Machine learning can also automate and scale the fraud detection process, reducing the need for manual intervention and human error.
In this blog, you will learn how to use Python and pandas to perform data preparation and exploration for fraud detection. Data preparation and exploration are essential steps in any machine learning project, as they help you to understand your data, identify potential issues, and extract useful features for your machine learning model.
You will use a real-world data set from Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders. The data set contains 284,807 transactions, of which 492 are fraudulent. The data set is highly imbalanced, as the positive class (frauds) accounts for only 0.172% of all transactions.
You will learn how to:
- Collect the data from Kaggle using the Kaggle API
- Clean the data and handle missing values and outliers
- Explore the data and perform descriptive statistics and data visualization
- Engineer new features from the existing data to improve the performance of your machine learning model
By the end of this blog, you will have a solid foundation for building your own machine learning model for fraud detection using Python and pandas.
Are you ready to dive into the data? Let’s get started!
2. Data Collection
The first step in any data analysis project is to collect the data that you will use for your machine learning model. In this case, you will use a data set from Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders.
The data set is available here, and you can download it directly from the website or use the Kaggle API to access it from your Python code. The Kaggle API is a simple way to interact with Kaggle datasets and competitions using Python. You can install it using pip:
pip install kaggle
Before you can use the Kaggle API, you need to create an account on Kaggle and generate an API token. You can find the instructions on how to do that here. Once you have your token, you need to place it in a file called kaggle.json in the location ~/.kaggle/ (or C:\Users\\.kaggle\ for Windows users).
After you have set up your Kaggle API, you can use the following code to download the credit card fraud data set:
import kaggle kaggle.api.authenticate() kaggle.api.dataset_download_files('mlg-ulb/creditcardfraud', path='data', unzip=True)
This will download a zip file containing a CSV file called creditcard.csv and save it in a folder called data. You can then use pandas to read the CSV file and store it in a DataFrame:
import pandas as pd df = pd.read_csv('data/creditcard.csv')
Congratulations, you have successfully collected the data for your machine learning project! You can now proceed to the next step, which is data cleaning.
3. Data Cleaning
Data cleaning is the process of identifying and correcting errors, inconsistencies, and anomalies in your data. Data cleaning is important for ensuring the quality and reliability of your data analysis and machine learning results. Data cleaning can also help you to reduce the noise and complexity of your data, making it easier to explore and understand.
In this section, you will perform some basic data cleaning tasks on the credit card fraud data set using Python and pandas. You will learn how to:
- Check the shape and data types of your DataFrame
- Handle missing values and duplicates
- Detect and remove outliers
Let’s start by checking the shape and data types of your DataFrame. You can use the shape and info methods to get a quick overview of your data:
print(df.shape) df.info()
This will output the following:
(284807, 31) <class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float64 1 V1 284807 non-null float64 2 V2 284807 non-null float64 3 V3 284807 non-null float64 4 V4 284807 non-null float64 5 V5 284807 non-null float64 6 V6 284807 non-null float64 7 V7 284807 non-null float64 8 V8 284807 non-null float64 9 V9 284807 non-null float64 10 V10 284807 non-null float64 11 V11 284807 non-null float64 12 V12 284807 non-null float64 13 V13 284807 non-null float64 14 V14 284807 non-null float64 15 V15 284807 non-null float64 16 V16 284807 non-null float64 17 V17 284807 non-null float64 18 V18 284807 non-null float64 19 V19 284807 non-null float64 20 V20 284807 non-null float64 21 V21 284807 non-null float64 22 V22 284807 non-null float64 23 V23 284807 non-null float64 24 V24 284807 non-null float64 25 V25 284807 non-null float64 26 V26 284807 non-null float64 27 V27 284807 non-null float64 28 V28 284807 non-null float64 29 Amount 284807 non-null float64 30 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB
As you can see, your DataFrame has 284,807 rows and 31 columns. All the columns are numeric, either float64 or int64. The columns are named as follows:
- Time: The seconds elapsed between each transaction and the first transaction in the data set.
- V1-V28: The principal components obtained by applying PCA (Principal Component Analysis) on the original features. PCA is a dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components. PCA is used in this data set to protect the sensitive information of the cardholders.
- Amount: The transaction amount.
- Class: The target variable, which indicates whether the transaction is fraudulent (1) or not (0).
Now that you have a general idea of your data, you can check for missing values and duplicates. Missing values are values that are not recorded in the data set, and they can affect the accuracy and performance of your machine learning model. Duplicates are repeated rows in the data set, and they can introduce bias and redundancy in your analysis.
You can use the isnull and duplicated methods to check for missing values and duplicates, respectively. You can also use the sum method to count the number of missing values or duplicates in each column or row:
# Check for missing values print(df.isnull().sum()) # Check for duplicates print(df.duplicated().sum())
This will output the following:
Time 0 V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 Amount 0 Class 0 dtype: int64 1081
As you can see, there are no missing values in the data set, which is good news. However, there are 1081 duplicates, which is not ideal. You can remove the duplicates using the drop_duplicates method:
# Remove duplicates df = df.drop_duplicates() print(df.shape)
This will output the following:
(283726, 31)
As you can see, your DataFrame now has 283,726 rows and 31 columns, which means that 1081 rows have been removed.
The next step in data cleaning is to detect and remove outliers. Outliers are extreme values that deviate significantly from the rest of the data, and they can affect the distribution and statistics of your data. Outliers can also skew the results of your machine learning model, especially if you are using algorithms that are sensitive to outliers, such as linear regression or k-means clustering.
There are different methods to detect outliers, such as using statistical tests, box plots, or z-scores. In this tutorial, you will use the z-score method, which measures how many standard deviations a value is away from the mean. A high z-score indicates that the value is far from the mean, and therefore, it is likely to be an outlier. You can use the scipy library to calculate the z-scores of your data:
from scipy import stats import numpy as np # Calculate the z-scores of each column z_scores = stats.zscore(df) # Find the absolute z-scores greater than 3 outliers = np.abs(z_scores) > 3 # Find the number of outliers in each column print(outliers.sum(axis=0))
This will output the following:
[ 646 29 101 113 144 298 371 374 319 840 1282 234 318 305 492 259 533 967 474 360 578 483 306 643 474 322 332 507 330 31329 473]
As you can see, some columns have a high number of outliers, such as V10, V17, and Amount. You can remove the outliers using the ~ operator, which returns the opposite of a boolean array:
# Remove the outliers df = df[~outliers.any(axis=1)] print(df.shape)
This will output the following:
(251608, 31)
As you can see, your DataFrame now has 251,608 rows and 31 columns, which means that 32,118 rows have been removed.
You have completed the data cleaning step of your machine learning project. You have checked and corrected the data types, handled missing values and duplicates, and detected and removed outliers. You can now move on to the next step, which is data exploration.
4. Data Exploration
Data exploration is the process of analyzing and visualizing your data to gain insights and understanding of its characteristics, patterns, and relationships. Data exploration is important for discovering the features and trends of your data, identifying potential problems or anomalies, and generating hypotheses and questions for further analysis.
In this section, you will perform some basic data exploration tasks on the credit card fraud data set using Python and pandas. You will learn how to:
- Summarize the descriptive statistics of your data
- Visualize the distribution and correlation of your data
- Compare the features and classes of your data
Let’s start by summarizing the descriptive statistics of your data. You can use the describe method to get a quick overview of the numerical attributes of your data, such as the mean, standard deviation, minimum, maximum, and quartiles:
# Summarize the descriptive statistics df.describe()
This will output the following:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class count 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 251608.000000 mean 94877.348414 0.000885 -0.000956 0.001139 0.000798 -0.000846 0.000293 -0.000782 -0.000237 -0.000292 -0.000387 -0.000099 -0.000178 0.000015 -0.000217 0.000041 -0.000137 -0.000072 -0.000064 0.000011 0.000021 -0.000024 0.000008 0.000011 0.000016 -0.000012 -0.000005 0.000003 0.000002 88.291062 0.001637 std 47485.413918 1.953813 1.636146 1.489966 1.402857 1.356952 1.322677 1.212996 1.184732 1.095882 1.057647 1.014726 0.994674 0.992620 0.951998 0.914894 0.873696 0.849337 0.838176 0.813569 0.769984 0.732369 0.724550 0.623702 0.605776 0.521278 0.482227 0.401176 0.329570 250.120109 0.040409 min 0.000000 -28.344757 -40.978852 -31.103685 -4.848504 -32.911462 -21.307738 -26.548144 -41.484823 -13.434066 -14.741096 -4.568390 -17.769143 -5.791881 -18.822086 -4.498945 -14.129855 -25.162799 -9.498746 -7.213527 -21.387122 -22.889347 -8.887017 -36.666000 -2.836627 -7.519589 -2.068561 -7.764147 -9.617915 0.000000 0.000000 25% 54230.000000 -0.917459 -0.600321 -0.889682 -0.848643 -0.692256 -0.768227 -0.553788 -0.208633 -0.643098 -0.535578 -0.762494 -0.406495 -0.647862 -0.425574 -0.583016 -0.468037 -0.483748 -0.498850 -0.456299 -0.211721 -0.228305 -0.542350 -0.161703 -0.354453 -0.317193 -0.326984 -0.070840 -0.052960 5.600000 0.000000 50% 84711.000000 0.018330 0.065238 0.179963 -0.020206 -0.054243 -0.274172 0.040103 0.022041 -0.051429 -0.092917 -0.032757 0.140033 -0.013568 0.050601 0.048072 0.066413 -0.065676 -0.003636 0.003735 -0.062481 -0.029450 0.006782 -0.011147 0.040976 0.016594 -0.052139 0.001342 0.011244 22.000000 0.000000 75% 139333.000000 1.315693 0.803357 1.027196 0.742635 0.612218 0.398565 0.570474 0.327214 0.597139 0.453923 0.739593 0.618238 0.662505 0.493150 0.648821 0.523296 0.399675 0.500807 0.458949 0.133041 0.186377 0.528554 0.147642 0.439527 0.350716 0.240952 0.091045 0.078280 77.165000 0.000000 max 172792.000000 2.454930 3.623778 3.757300 10.233200 3.480167 3.386904 3.620110 3.756096 3.353525 3.747731 3.800173 2.671125 3.852046 2.824849 3.752881 2.836684 2.830199 3.790316 4.851255 3.942090 3.052358 3.057333 3.650900 1.205420 2.208209 1.807886 3.052358 3.385779 19656.530000 1.000000
As you can see, the describe method gives you a summary of the basic statistics of each column, such as the count, mean, standard deviation, minimum, maximum, and the 25th, 50th, and 75th percentiles. You can use this information to get a sense of the scale, distribution, and variability of your data. For example, you can see that the Amount column has a large range of values, from 0 to 19656.53, and a high standard deviation of 250.12, which indicates that the data is skewed and has some extreme values. You can also see that the Class column has a mean of 0.001637, which means that only 0.1637% of the transactions are fraudulent,
4.1. Descriptive Statistics
After you have cleaned your data, you can perform some descriptive statistics to get a better understanding of your data. Descriptive statistics are numerical or graphical summaries of the characteristics and distribution of your data. They can help you to answer questions such as:
- What is the average, minimum, maximum, median, or standard deviation of a variable?
- How many observations are there in each category of a categorical variable?
- How are the variables correlated with each other and with the target variable?
- How are the variables distributed and what are their outliers?
You can use pandas to perform descriptive statistics on your DataFrame. Pandas provides many methods and functions to calculate and display various statistics of your data. For example, you can use the describe() method to get a summary of the numerical variables in your DataFrame:
df.describe()
This will return a table with the following statistics for each numerical variable: count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum. You can also use the include parameter to include categorical variables in the summary:
df.describe(include='all')
This will add the following statistics for each categorical variable: count, unique, top, and frequency. The top value is the most frequent value, and the frequency is the number of times the top value appears in the data.
You can also use the value_counts() method to get the frequency of each value in a categorical variable. For example, you can use it to see how many transactions are fraudulent and how many are not:
df['Class'].value_counts()
This will return a Series with the number of transactions for each class (0 for non-fraudulent, 1 for fraudulent). You can also use the normalize parameter to get the proportion of each class instead of the count:
df['Class'].value_counts(normalize=True)
This will return a Series with the percentage of transactions for each class. You can see that the data is highly imbalanced, as only 0.172% of the transactions are fraudulent.
Another useful method to perform descriptive statistics is the corr() method, which calculates the correlation coefficient between each pair of variables in your DataFrame. The correlation coefficient is a measure of how two variables are linearly related, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). A correlation coefficient close to 0 means that there is no linear relationship between the variables. You can use the corr() method to see how the variables are correlated with each other and with the target variable:
df.corr()
This will return a DataFrame with the correlation coefficients between each pair of variables. You can use the style.background_gradient() method to display the DataFrame as a heatmap, where the color intensity indicates the strength of the correlation:
df.corr().style.background_gradient()
You can see that most of the variables have low or no correlation with each other, except for some pairs that have moderate or high correlation. You can also see that the target variable (Class) has low or no correlation with most of the variables, except for some that have moderate negative correlation (such as V3, V9, V10, V12, V14, and V17).
By performing descriptive statistics, you can gain some insights into your data and identify potential issues or opportunities for your machine learning model. In the next section, you will learn how to use data visualization to further explore your data and discover patterns and trends.
4.2. Data Visualization
Data visualization is another important step in data exploration, as it allows you to see the patterns and trends in your data using graphical representations. Data visualization can help you to:
- Compare the distribution and frequency of different variables and categories
- Identify outliers and anomalies in your data
- Discover relationships and correlations between variables
- Communicate your findings and insights effectively to others
You can use various Python libraries to create data visualizations, such as matplotlib, seaborn, plotly, and bokeh. In this blog, you will use matplotlib and seaborn, which are two popular and powerful libraries for creating static and interactive plots. You can install them using pip:
pip install matplotlib seaborn
Before you can use matplotlib and seaborn, you need to import them in your Python code. You can also use the %matplotlib inline magic command to display the plots in your Jupyter notebook:
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
One of the simplest and most common types of data visualization is a histogram, which shows the frequency distribution of a single variable. You can use the hist() method of matplotlib to create a histogram of any numerical variable in your DataFrame. For example, you can create a histogram of the Amount variable, which represents the transaction amount in euros:
plt.hist(df['Amount'], bins=50) plt.xlabel('Transaction amount (euros)') plt.ylabel('Frequency') plt.title('Histogram of transaction amount') plt.show()
You can see that the majority of the transactions have a very low amount, less than 100 euros, and only a few transactions have a high amount, more than 1000 euros. This indicates that the data is skewed and has outliers.
You can also use the hist() method to create a histogram of the Class variable, which represents the fraud label (0 for non-fraudulent, 1 for fraudulent):
plt.hist(df['Class'], bins=2) plt.xlabel('Fraud label') plt.ylabel('Frequency') plt.title('Histogram of fraud label') plt.show()
You can see that the data is highly imbalanced, as there are much more non-fraudulent transactions than fraudulent ones. This can pose a challenge for your machine learning model, as it may not learn well from the minority class.
Another useful type of data visualization is a boxplot, which shows the distribution of a variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. You can also see the outliers in the data as points beyond the whiskers of the box. You can use the boxplot() method of seaborn to create a boxplot of any numerical variable in your DataFrame. For example, you can create a boxplot of the Amount variable:
sns.boxplot(df['Amount']) plt.xlabel('Transaction amount (euros)') plt.title('Boxplot of transaction amount') plt.show()
You can see that the box is very small and close to zero, indicating that the median and the quartiles are very low. You can also see that there are many outliers in the data, as there are many points beyond the whiskers of the box. This confirms that the data is skewed and has outliers.
You can also use the boxplot() method to create a boxplot of the Amount variable for each class (0 or 1) by using the hue parameter. This can help you to compare the distribution of the variable across different categories. For example, you can create a boxplot of the Amount variable for each fraud label:
sns.boxplot(x='Class', y='Amount', data=df, hue='Class') plt.xlabel('Fraud label') plt.ylabel('Transaction amount (euros)') plt.title('Boxplot of transaction amount by fraud label') plt.show()
You can see that the boxplots for both classes are similar, with a small box and many outliers. However, you can also see that the outliers for the fraudulent class are higher than the outliers for the non-fraudulent class, indicating that some fraudulent transactions have a very high amount.
By using data visualization, you can explore your data in a visual way and discover patterns and trends that may not be obvious from descriptive statistics. In the next section, you will learn how to use feature engineering to create new features from your data that can improve the performance of your machine learning model.
4.3. Feature Engineering
Feature engineering is the process of creating new features from your existing data that can enhance the performance of your machine learning model. Feature engineering can help you to:
- Reduce the dimensionality and complexity of your data
- Capture the non-linear and interaction effects of your variables
- Handle the imbalanced and skewed nature of your data
- Increase the accuracy and interpretability of your model
You can use various techniques and methods to perform feature engineering, such as scaling, transformation, encoding, selection, extraction, and generation. In this blog, you will use some of these techniques to create new features from your credit card fraud data set. You will use pandas and scikit-learn, which are two popular and powerful libraries for data manipulation and machine learning in Python. You can install them using pip:
pip install pandas scikit-learn
Before you can use pandas and scikit-learn, you need to import them in your Python code. You can also use the np.random.seed() function to set a random seed for reproducibility:
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer, OneHotEncoder from sklearn.feature_selection import SelectKBest, f_classif from sklearn.decomposition import PCA from sklearn.cluster import KMeans np.random.seed(42)
One of the simplest and most common techniques of feature engineering is scaling, which is the process of changing the range and distribution of your variables to a standard scale. Scaling can help you to:
- Normalize the effect of different units and scales of your variables
- Reduce the impact of outliers and extreme values on your variables
- Improve the convergence and stability of your machine learning algorithms
You can use scikit-learn to perform scaling on your data using different methods, such as standard scaling, robust scaling, and power transformation. For example, you can use the StandardScaler() class to scale your data to have zero mean and unit variance:
scaler = StandardScaler() df_scaled = scaler.fit_transform(df)
This will return a numpy array with the scaled values of your DataFrame. You can convert it back to a DataFrame and see the summary statistics of the scaled data:
df_scaled = pd.DataFrame(df_scaled, columns=df.columns) df_scaled.describe()
5. Conclusion
In this blog, you have learned how to use Python and pandas to perform data preparation and exploration for fraud detection. You have followed these steps:
- Data Collection: You have downloaded the credit card fraud data set from Kaggle using the Kaggle API and loaded it into a pandas DataFrame.
- Data Cleaning: You have checked and handled missing values and outliers in your data using various methods, such as dropping, imputing, and clipping.
- Data Exploration: You have performed descriptive statistics and data visualization on your data using pandas, matplotlib, and seaborn. You have explored the distribution, frequency, correlation, and outliers of your variables.
- Feature Engineering: You have created new features from your data using various techniques, such as scaling, transformation, encoding, selection, extraction, and generation. You have used scikit-learn to apply these techniques and improve your data for machine learning.
By following these steps, you have prepared and explored your data for fraud detection and gained some insights and understanding of your data. You have also created a solid foundation for building your own machine learning model for fraud detection using Python and pandas.
We hope you have enjoyed this blog and learned something new and useful. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading and happy coding!