Luis Caballero Diaz's profile

Machine Learning Imbalanced Data Part 1

This project focuses on an machine learning exercise based on an imbalanced dataset. The project is split in two parts, covering part 1 in this publication.

Part 1 focused on introducing main challenges for an imbalanced dataset as selecting a proper scoring method and perform a model evaluation:

Part 2 focused on cross validation and grid search using pipeline to optimize AUC for model tuning and interpret the results:

As reference, the work is done with Python and scikit learn. The dataset to use is an open dataset from kaggle. The code and dataset can be found in the below links.


This data is very extensive containing information for almost 300k transactions with a total of 30 features, plus a target output class indicating if the transaction was a fraud or not. However, there are two particularities in the dataset:
     1. There are only 492 fraud transactions, which corresponds to 0.172% of the cases
     2. Among the 30 features, 28 are masked by PCA transformation due to confidentiality           issues, the other two features are not masked corresponding to Time and Amount

Both particularities are very important and will be discussed later being the imbalanced dataset the key topic for the present project.

First of all, let's generate an histogram with the input data. PCA features were standardized prior to PCA appliance since all means are null. In general, data contains a high degree of outliers since the margins of the histogram are far from the visible extremes data. However, these outliers will not be removed since when done as a test, a significant percentage of the already limited fraud cases were removed and final scores were reduced.
SCALING
Dataset is partly affected by leakage between training and testing set since PCA has been applied to the whole dataset and a previous standardization fit process was applied. It can be checked by performing the mean of each of 28 PCA features, which is null. That is not correct, standardization fit process should be applied always after spitting training and testing sets and only to be applied in the training set. Otherwise if applied together, the future training set is scaled with some information leakage from the future testing set, meaning when performing the final model confirmation against the testing data, it will not be completely new data. It might lead to potential overfitting and worse predictions than expected in front a completely new data. For a full isolation between training and testing sets, the scaling should be applied exclusively to the training set. However, the change cannot be undone and let's proceed with this data. 

Other remark is that the PCA was applied to 28 features leaving Time and Amount unscaled. If a tree algorithm is used, the results are not impacted by scaling, but the rest of algorithms will be impacted if these two features are not scaled. Prior to scale data, let's modify the Time feature which now measures the time in seconds from initial transaction to the current transaction. This feature does not look to be relevant because it ends up being a kind of scaled irregular transaction counter. Instead, it makes sense to focus on the time referred to the previous transaction. For example, fraud transactions might be performed in a row or even simultaneously with some automatic process. Thus, pandas function diff() is applied to Time column. Overall, that change improves the final score under the particular project conditions from 0.88 to 0.95 in Logistic Regression, from 0.85 to 0.88 in Decision Tree, but it reduces from 0.96 to 0.95 in Naive Bayes.

Now let's scale the columns. Standardization is done independently on each feature, so it is possible to fit and transform only the features Time and Amount, leaving aside the rest of PCA features which were already scaled by the data owner. Note the scaling is applied after splitting data in training and testing sets with 80-20% distribution. Scaling Time and Amount improves the score under the particular project conditions in Logistic Regression from 0.95 to 0.97 and no effect in the other two algorithms. However, it is worth it to highlight that Logistic Regression improved from 0.88 to 0.97 by applying some data scrubbing techniques.
IMBALANCED DATA
Next plot shows the clearly marked imbalanced in the dataset.
As example of the risk behind an imbalanced dataset, let's create a dummy classifier to classify all cases based on the most frequent output (in this case, not fraud). As previously mentioned, data is split between training and testing sets with 80-20% distribution, leading to around 230k training samples and 60k in testing set. The code would be as follows:
Reaching the following result:
Focusing on accuracy as score, defined as the ratio between the cases in which the model classified well the sample vs the total number of samples, is not working in an imbalanced dataset as previous example depicts. There are so few fraud cases that considering all cases as not fraud leads to an accuracy level of 1. Therefore, a different scoring method is needed when working with imbalanced dataset. Note that imbalanced dataset are always supervised exercises, otherwise it would not be possible to know if it is imbalanced or not. 

The confusion matrix is a tool which provide some helpful information about how the model works for imbalanced datasets. The purpose of the confusion matrix is to capture the four following values:
     1. Cases which model predicted positive when sample was negative (False positive)
     2. Cases which model predicted positive when sample was positive (True positive)
     3. Cases which model predicted negative when sample was negative (True negative)
     4. Cases which model predicted negative when sample was positive (False negative)
When calculating the confusion matrix for our dummy example, it is easily detectable that something is wrong in spite of having an accuracy of 1. As the model predicts everything to be negative, there is no FP or TP. The model matches 56864 negative cases, but it misses 98 fraud transactions.
The confusion matrix is a good visual tool to extract information about how model performs, and from confusion matrix data (FP, TP, FN and TN), there are significant parameters to identify the model operation such as specificity, sensitivity, precision and recall (or f1-score which combines recall and precision). Instead, the focus will be on true positive ratio (TPR, which is the same as recall) and false positive ratio (FPR). 

TPR is the ratio between the positive samples which were properly predicted and the total number of positive samples and FPR is the ratio between the negative samples which were not properly predicted and the total number of negative samples. For example:

TPR = 1 means that all positive samples were properly predicted
TPR = 0 means that the model did not predict properly any real positive sample
FPR = 0 means that all negative samples were properly predicted
FPR = 1 means that the model did not predict properly any negative sample

Meaning the ideal values correspond to FPR = 0 and TPR = 1.​​​​​​​
The confusion matrix of the dummy example can be used to calculate TPR and FPR and clarify further the concepts.
TN = 56864
FN = 98
TP = 0
FP = 0

TPR = 0 / (0 + 98) = 0 --> Dummy model did not capture any positive case among the 98 assessed cases
FPR =  0 / (56864 + 0) = 0 --> Dummy model perfectly captured all 56864 negative case

This is an extreme example, but the conclusion is that the dummy example offers a perfect negative sample matching in exchange a non-existent positive sample detection. It is possible to anticipate that imbalanced classification is linked to trade off decisions, meaning what is the downside of predicting as negative a real positive case? And what is the downside of predicting as positive a real negative case? In this case, not detecting a positive case would imply a fraud to be undetected and detecting as positive a real negative case would imply an extra cost and time and potential customer trouble to investigate that transaction to end up realizing everything was correct. Therefore, the trade off between TPR and FPR is business case related and it must be assessed for each case under study. 

Let's run now real supervised machine learning models, calculating the confusion matrix and compare the results.
The confusion matrix now makes more sense, let's calculate TPR and FPR ratios for each case.

TPR Logistic = 60 / 98 = 0.61
TPR Naive Bayes = 80 / 98 = 0.82
TPR Decision Tree = 71 / 98 = 0.72
FPR Logistic = 11 / 56864 = 0
FPR Naive Bayes = 1299 / 56864 = 0.02
FPR Decision Tree = 11 / 56864 = 0

With the used model parametrization would look that Naive Bayes might be the best choice if it is possible to accept that FPR = 0.02, which will be determined by the final application. If it cannot be accepted, Decision Tree would look to work better than Logistic Regression. 

However, a part from the model parameters, there is an important tuning to consider. How certain is a model when classifying positive or negative? All model have a decision function and/or predict probability function to take that decision. Let's focus on predict probability function. which is easier to understand. This function has a shape according to the number of samples having each position a list according to the number of classes. In this case, the testing set has 56962 samples and the exercise is a binary classification. The content of each position sums 1 and the first column refers to the likelihood of the sample being negative and the second column refers to the likelihood of being positive. By default the decision threshold is 0.5, but it might be tuned in case under a high degree of imbalanced.
The above TPR/FPR values for logistic regression, naive bayes and decision tree are referred to a default threshold of 0.5, but it would be good to loop the threshold between 0 and 1 and calculate the TPR/FPR in each iteration, and then plot it to assess it visually. That is known as receiver operating characteristics curve, or more commonly called ROC curve. The ideal operating point in a ROC curve is the left top corner, meaning TPR = 1 and FPR = 0. The ROC curve for the case under analysis is depicted below.
There is a ROC curve per each model considered in the analysis and the default threshold of 0.5 is marked with a circle. The rest of the curve corresponds to different threshold values, meaning each curve point is feasible for that model using that parametrization if selecting the corresponding threshold for predictions. 

The score method comparison when operating in imbalanced datasets, as discussed earlier, cannot be accuracy. Instead, the most common scoring is AUC (area under the curve, and the curve is intrinsically referred to the ROC curve). The own name self explains what AUC is, and so the algorithm with higher AUC value will likely be the best to capture trends, variances and model the dataset under analysis. AUC is imbalanced class invariant and threshold invariant, meaning that AUC gathers all thresholds and when a model is selected for having the highest AUC, then a final tuning for the optimal threshold is required. 

In our case, among the selected algorithms, Logistic Regression has higher area under the curve, but it can be confirmed with the scikit learn function roc_auc_score.
Logistic Regression captures the dataset trends in a better way than Naive Bayes and Decision Tree, and so among the analyzed models, Logistic Regression would be the preferred one.

Note AUC is a global score method and so even a model has higher AUC might work worse than a model with lower AUC for a very specific conditions. For example, if in the above ROC curve graph, the requirement for FPR is to be lower than 0.6, Naive Bayes could provide a TPR = 1, being better than Logistic Regression only for this particular area, but in general Logistic Regression is able to capture in a better way the dataset trends and variances. In the same way, if requirement is FPR = 0, Decision Tree would work better than Naive Bayes in spite of a worse global matching. However, following the model with higher AUC is a good practice and leads to a promising model for the dataset under analysis.
Taking a new look into the ROC curve, there is also a triangle in each curve corresponding to the optimal threshold, maximizing the match between the dataset and the model. The used method to define the optimal is Youden's J statistics, which maximizes the distance between TPR minus FPR to find the point closer to the ideal left top corner, as depicted below from Wikipedia. Other way to reach the same conclusion for optimal point is to select the point with minimum distances to the top left corner, but without using Pythagorean theorem because it would intrinsically weight different TPR and FPR. At both cases, the conclusion is that the optimal threshold is around the top left corner prior to start flattening the ROC curve.
The optimal threshold calculation leads to a different confusion matrix and so, different TPR/FPR ratios which are depicted below. In all three cases, the TPR ratio increases in exchange of increasing FPR too, but overall the model captures in a better way the dataset trend.
These optimal points were calculated assuming that a deviation in TPR and FPR has the same weight, meaning it has the same consequences to have FPR = 0.1 (considering positive 10% of the negative samples) than TPR = 0.9 (considering negative 10% of the positive samples). However, if the weights are not equal in our application Youden's J statistics equation can be modified to include a TPR weight assuming FPR weight to be 1, as depicted next. 
It means that if TPR weight is defined to 2, the criticality of false negatives is twice higher than the criticality of false positives. Instead, if defined to 0.5 means the criticality of false positives is twice higher than the criticality of false negatives. The proper weight is purely dependent on the business case under analysis. For example, it is not the same a false negative when classifying spam emails than when detecting a cancer disease. The TPR weight in the cancer case should be much higher.

For example, next examples shows the ROC curve for a TPR weight of 2 and 0.5 respectively.
As expected, the optimal threshold for the case TPR weight = 2 is closer to horizontal TPR = 1 line in exchange a significant increase in FPR. Instead, the optimal threshold for the case TPR weight = 0.5 is closer to vertical FPR = 0 line. 

For the case TPR weight = 2, the change has been more aggressive since the TPR increase came in exchange of significant increase of FPR. That is caused due to the ROC curve flattering and so, each step increase in TPR comes in exchange of a more significant increase in FPR, and so it is a trade off until the point the benefit for minorly increasing TPR does not compensate the drawback of increasing majorly FPR. For the case TPR weight = 0.5, the change is more subtle, leading to a minor reduction in both TPR and FPR, even Naive Bayes reaches the same optimal point as the uniform case.

As discussed earlier, the criterion to make the most suitable decision about the model threshold cannot be summarized in an equation or technique since it is business case dependent and it will be defined according to the potential consequences of having false positives and false negatives. 

A summary of TPR/FPR for all three models assessed with default threshold and optimized threshold for different weights is depicted below.
Next step would be to search the most optimal model tuning, as well as testing different algorithms and applying a more robust validation with cross validation technique. That work will be captured in a second part of this project, which will be covered in a different publication (see introduction).
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Machine Learning Imbalanced Data Part 1
Published:

Machine Learning Imbalanced Data Part 1

Published:

Creative Fields