Credit Card Fraud Detection

By Bruno Ferrari | March 26, 2020

Objective

Our goal is to train a Neural Network to detect fraudulent credit card transactions in a dataset referring to two days transactions by european cardholders.

Source: https://www.kaggle.com/mlg-ulb/creditcardfraud/data

Data

credit = read.csv(path)

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days.

As we can see, this dataset consists of thirty explanatory variables, and a response variable which represents whether a transation was a fraud or not. Due to confidentiality issues it contains only numerical input variables which are the result of a PCA transformation, the only variables which have not been transformed with PCA are ‘Time’ and ‘Amount’ (this time variable is not relevant for us, because is only a marker the transations that happened first).

str(credit)
## 'data.frame':    284807 obs. of  31 variables:
##  $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
##  $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
##  $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
##  $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
##  $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
##  $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
##  $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
##  $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
##  $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
##  $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
##  $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
##  $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
##  $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
##  $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
##  $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
##  $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
##  $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
##  $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
##  $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
##  $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
##  $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
##  $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
##  $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
##  $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
##  $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
##  $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
##  $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
##  $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
##  $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
##  $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
##  $ Class : int  0 0 0 0 0 0 0 0 0 0 ...

Other aspect of this dataset is that it has a highly unbalanced classes, the positive class (frauds) account for 0.172% of all transactions.

Exploiting the fact of what PCA has already been applied to data, we can makes some visual inspect of the data. If we remember the characteristics PCA technique, we have the fact which the firsts components can be used to summarize of the dataset.

Although we don’t have the original data, it is possible to know how much of the data is explained by these components. This amount is around a 28.84% as we see blow.

credit_2 = credit[, -c(1,30,31)]
pca_credit = prcomp(credit_2)
summary(pca_credit)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.9587 1.65131 1.51626 1.41587 1.38025 1.33227
## Proportion of Variance 0.1248 0.08873 0.07481 0.06523 0.06199 0.05776
## Cumulative Proportion  0.1248 0.21357 0.28838 0.35361 0.41560 0.47335
##                           PC7     PC8     PC9    PC10   PC11    PC12
## Standard deviation     1.2371 1.19435 1.09863 1.08885 1.0207 0.99920
## Proportion of Variance 0.0498 0.04642 0.03927 0.03858 0.0339 0.03249
## Cumulative Proportion  0.5232 0.56957 0.60884 0.64742 0.6813 0.71381
##                           PC13   PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.99527 0.9586 0.91532 0.87625 0.84934 0.83818
## Proportion of Variance 0.03223 0.0299 0.02726 0.02498 0.02347 0.02286
## Cumulative Proportion  0.74605 0.7760 0.80321 0.82819 0.85167 0.87453
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.81404 0.77093 0.73452 0.72570 0.62446 0.60565
## Proportion of Variance 0.02156 0.01934 0.01756 0.01714 0.01269 0.01194
## Cumulative Proportion  0.89609 0.91543 0.93298 0.95012 0.96281 0.97474
##                           PC25    PC26   PC27    PC28
## Standard deviation     0.52128 0.48223 0.4036 0.33008
## Proportion of Variance 0.00884 0.00757 0.0053 0.00355
## Cumulative Proportion  0.98359 0.99115 0.9964 1.00000

Here we can see how the fraudulent transactions and the not fraudulent in general is quite similar. In the plot is actually true which there are some red cases which separate of the blues one but considering how unbalanced is the data, this can be not representative. This makes our job harder because is not clear what characteristics makes a fraudulent transactions.

Model Fitting

As discussed above, we have two main caracteristics of the data:

1 High unbalanced classes

2 Homogeneity our not cleary separations of the classes at least in low dimensions.

So, because of that, we going to use a Neural Network to indentify fraud transactions. Neural Networks have huge capacities in sense to adapt well in many raw data and therefore not need (in general) does data transformations. This is important because we do not have access to original data.

Packages used in this job.

library("caTools")
library("caret")
library("keras")
library("ROCR")

Continuing, let splitting the dataset into train and test. We also going to drop-off the time features of the dataset.

credit = credit[, -c(1)]

set.seed(42)

split = sample.split(credit$Class, SplitRatio = 0.75)
train = subset(credit, split==TRUE)
test = subset(credit, split==FALSE)

Using the keras package we create our Neural Network with 29 input layer (dimension of the dataset), one hidden layer with 32 neurons and 1 neuron on output layer.

model <- keras_model_sequential(name = "credit_NN") 
model %>% 
  layer_dense(units = 32, activation = 'relu', input_shape = 29, kernel_initializer = 'uniform', name = "NN_IN") %>%
  layer_dense(units = 1 , activation = 'sigmoid', kernel_initializer = 'uniform', name = "NN_OUT")
model
## Model
## Model: "credit_NN"
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## NN_IN (Dense)                    (None, 32)                    960         
## ___________________________________________________________________________
## NN_OUT (Dense)                   (None, 1)                     33          
## ===========================================================================
## Total params: 993
## Trainable params: 993
## Non-trainable params: 0
## ___________________________________________________________________________

Compile the Network, and choose the accuracy metric for evaluation.

model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = "adam",
  metrics = c('accuracy')
)

Fitting the Network with the data.

history <- model %>% fit(
  x = as.matrix(train[, -c(30)]), y = train[, c(30)], 
  epochs = 30, batch_size = 128, 
  validation_split = 0.2
)

plot(history)

Results

For evaluate how good our network are, we going to set a dummy model which predict that all results is the main class (0 - Not Fraud). As we see below, the accuracy of this model is around of 99.83%, so is desirable our Neural Network has better results.

y_dummy = replicate(nrow(credit), 0)

confusionMatrix(data = as.factor(y_dummy), reference = as.factor(credit$Class), positive = "1", mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 284315    492
##          1      0      0
##                                           
##                Accuracy : 0.9983          
##                  95% CI : (0.9981, 0.9984)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 0.512           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##               Precision :       NA        
##                  Recall : 0.000000        
##                      F1 :       NA        
##              Prevalence : 0.001727        
##          Detection Rate : 0.000000        
##    Detection Prevalence : 0.000000        
##       Balanced Accuracy : 0.500000        
##                                           
##        'Positive' Class : 1               
## 

Here we can see the results of our model, which has a 99,94% of accuracy, slightly above of the dummy model but other statics are also important, such as a Recall rate which measures how well the model can correctly forecast the fraud class, which is our main objective. Considering again how unballanced are the class, we have here a good rate high than 70%

y_pred = model %>% predict_classes(as.matrix(test[,-c(30)]))
confusionMatrix(as.factor(y_pred), as.factor(test$Class), mode = "prec_recall", positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 71068    29
##          1    11    94
##                                           
##                Accuracy : 0.9994          
##                  95% CI : (0.9992, 0.9996)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8243          
##                                           
##  Mcnemar's Test P-Value : 0.00719         
##                                           
##               Precision : 0.895238        
##                  Recall : 0.764228        
##                      F1 : 0.824561        
##              Prevalence : 0.001727        
##          Detection Rate : 0.001320        
##    Detection Prevalence : 0.001475        
##       Balanced Accuracy : 0.882036        
##                                           
##        'Positive' Class : 1               
## 

Other metric which measures how well is the model is the AUC (Area under the Curve) of ROC (Receiver Operating Characteristic) Curve. 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding. The ROC curve show the tradeoff between the True Positive Rate and the False Positive Rate.

y_pred2 = model %>% predict_proba(as.matrix(test[,-c(30)]))
ROCRpred = prediction(y_pred2, test$Class)
ROCRperf = performance(ROCRpred, "tpr", "fpr")
auc_ROCR = performance(ROCRpred, measure = "auc")
auc_ROCR = auc_ROCR@y.values[[1]]
plot(ROCRperf, colorize=TRUE)
legend(0.8, 0.2, legend=c("AUC Area:", round(auc_ROCR, 2)), cex=0.8)
abline(a = 0, b = 1)

Conclusions

We can observe that Neural Networks are powerful structures. Although the highly unbalanced classes, no tunning of the model hyperparameters, and any data manipulations, The generated model shows good results identifying the fraud transactions of the dataset.

comments powered by Disqus