Prediction of Car Loan Default Results Based on Multi Model Fusion

: With the prosperity and development of the asset management industry and various financial derivatives, many micro-loans and online loans have gradually entered the public view. How to predict the default probability of customer loans is a hot topic in the market. Therefore, in this paper, by collecting the data profile of more than 10 thousand car loan borrowers and fitting the fusion model of 4 methods: logistic model, decision model, Random Forest, and KNN model to the data, the author examines the behavioral data of borrowers to predict whether the borrowers will default in the future and find the best threshold to reach the lowest cost. The findings indicate that our final prediction can reduce costs by 38.9%. The excellent result shows that this model can be applied to the real market to help lending institutions predict default results and formulate strategies to avoid default risks in the process of borrowers' evaluation according to the model coefficient.


Literature Review
With the deep development of big data and programming tools, classification and regression models are used to predict credit crisis [1]. Galindo used machine learning to research credit risk and got the conclusion that decision tree performed best in terms of classification results among the decision tree, neural networks, and k-nearest neighbor methods in the process of credit default prediction [2]. Malekipirbazari et al. conducted an empirical study in the Lending Club dataset and concluded that in the process of predicting borrower defaults, the random forest model outperformed FICO scores and Lending Club's own credit rating methodology [3]. Some scholars have improved loan default prediction by fusing different models [4,5,6], including GBDT, decision tree, XGBoost, and other models. All of the above previous studies provide the basis for this paper to integrate four models: logistic model, decision model, Random Forest, and KNN to predict the car loan default probability. The results are presented in the following sections.
Related models and methods: Random forest is a classifier that uses multiple decision trees to train and predict samples. In the process of classification, for each node of the base decision tree, a subset containing k attributes is first randomly selected from the set of attributes of that node, and then an optimal attribute is selected from this subset for division, where the parameter k controls the degree of randomness, generally =log2d [7]. In a random forest, each decision tree is disjoint, and the final classification result is determined by the plurality of the results obtained from each decision tree.

Introduction: Business Goal
Loan is an essential business for many financial institutions. Revenue can be earned from the commission fee (Interests) on any outstanding loan while a loss will be incurred if a loan defaults. Thus, whether to approve a loan of a certain amount to a certain customer becomes important for such financial institutions to maximize profit.
We hope to build a machine learning model to predict whether a customer will default in the future, given some key information provided. So that the institution can decide whether to approve or reject a loan application according to the prediction (Approve if non-default and vice versa).
Our models target to make predictions that maximize profits. And we would also like to conduct further analysis on our models to gain some insight into feature importance and data selection. Based on our findings, we attempt to propose some explanations and business suggestions for the financial institution.

Dataset Description
Source: https://www.kaggle.com/saurabhbagchi/dishnetwork-hackathon?select=Train_Dataset.csvw Label: 0 (Not default), 1 (Default), the distribution is shown in the pie chart (Exhibit 1) Features: there are 38 features (meaning shown in the data dictionary/progress report) with different missing% as listed below (Exhibit 2). Considering the large missing proportion and feature contribution, we have dropped some variables at the beginning such as 'Own_House_Age' and 'Social_Circle_Default'. As 'Score_Scource_1' is relatively highly correlated to the label, we decide to keep it for now.     Constricting a new data set: We have found that 'default' records take around 8.08% in the raw data. (Exhibit 4) To help model better recognize the default patterns, we try to drop the non-default records with 'NA' features and get a dataset with 80.37% defaults (total 12249 records).

Process Methodology
[please refer to Data _Processing_Version1 and Data Processing_Version2(D)] And we build each type of model on both datasets: (1)The original one with all features :train (2) The one with Non-default NA dropped: train_1

Cost and Benefit Analysis
Assumption Commission fee: 6% (for non-default customers) Exposure at Default (EAD) or Loan Amount per account: median of the credit amount Loss given Default (LGD): 55% (refer to the HKMA 45% recovery for unsecured loan) Financial Institution will reject all customers with 'Default' prediction Opportunity Cost Cost=# (False Positive) *Commission fee * Loan Amount + # (False Negative) *EAD*LGD Goal: As the financial institution would like to maximize their profits from the car loans, we will select the optimal model threshold to minimize the opportunity cost in the later stage.

Data
Source: train_cleaned.csv("train"), train_1_cleaned.csv ("train_1") We prepared two datasets and decided to build one models on each, then we would compare the two models and adopt the better-performing one. The 'train' dataset is the original one which contains all the features, while the 'train_1' dataset is the one whose non-default examples with "NA" values have been dropped.
Specific Data Processing: Discretization: As decision tree models tend to behave better on categorical data, we examined some numerical data. And for two features: Age_days and Credit_Amount, we conducted discretization and converted them to dummies. Intuitively, they are probably very important in predicting the results and are common targets for discretization in practice. While statistically, those 2 features have long-tail issues, so that we would like to further process.
Hyper-tunning: Our major target for hyper-tuning of the decision tree is to eliminate the over-fitting issue. So, we picked 2 hyperparameters: 'max_depth' and 'min_samples_split' and did a grid search for cross validation with cv=10.
( As the major target of this model is to maximize the profit, we care more about the true positive rate and false positive rate, and AUC more than sole accuracy, when doing the grid search, we set the scoring to be ROC-AUC. And we eventually adopted the best parameters output of max_depth=7 and min_sample_split=72 for model DT_Model (train) max_depth=8 and min_sample_split=72 for model DT_Model_1 (train_1) Model Training: We have built our model on two data sets: train_1_cleaned (train_1_df) and train_cleaned (train_df), and would decide which one to use according to their performance on the same test set. We split the train_1 and train into train and test set.
Considering we would need to conduct a cost-benefit analysis to determine an optimal threshold, we further split the train set in train into a sub-train set and a validation set (Exhibit 5). Due to the very poor performance of DT_model_1 in terms of accuracy, we decided to quit the model. Besides, we also noticed that this model has an issue of inadequate data size (Number of examples) considering the number of features.
Our next attempt is directly building a decision tree model based on train_df (the data set from train_cleaned.csv).
Recall that we have previously split 3 parts in train_df: (1) Sub_train set (2)Validation set (3) Test set. We built a second decision tree model, DT_model, on the sub_train set. And then we fit test set of train_df into the model and got a high result accuracy of 0.92 .  Considering the high accuracy and an acceptable AUC above 0.5, we decided to adopt DT_model.
Threshold adjustment based on cost-benefit analysis After getting the model (DT_model), we adopt cost and benefit analysis on our validation set to find the optimal decision threshold. The optimal threshold (for '0' prediction) is found to be 0.9. The 'Cost Curve' and 'Cost Comparison' are shown below :  The cost incurred by a majority classifier (all predict to be non-default) is about 43.9 million ("Total Cost"). The cost decreased by 23.95% after the prediction with default threshold of 0.93. Evaluation We then checked the optimal model with the decision threshold adjusted on the test set of train_df to get the results as follows: Accuracy: 0.6861 -Cost Reduction: 22.65% reduction compared to the majority classifier Regarding the 2 decision models we built, DT_model_1 (Nondefault NA dropped) and DT_model, though we eventually decided to adopt DT_model, we still think that the feature ranking of DT_model_1 may still give us some insights of the identification of default.
And for DT_model, though the final model has a relatively okay accuracy (0.6861) and cost reduction (22.65%), the AUC is still unideal, being just slightly above 0.5. It still needs improvement.

Logistic Model
Data Source: train_cleaned.csv("train"), train_1_cleaned.csv ("train_1") Hyper-tunning: After searching for related research about hyper parameters in Logistic Model, we found that the hyper parameters that have the highest relationship and biggest influence towards logistic model is the penalty type (whether "l1" or "l2") and the value of C (which determines the strength of penalty). We confine those 2 parameters into the range of penalty = ['l1', 'l2']", and C = [100, 10, 1.0, 0.1, 0.01], then import the library -GridSearchCv to help us try different combinations of hyper parameters one by one. Finally, we got the best result of those values and redefine our model based on them, which are {'C': 100, 'penalty': 'l1'} and {'C': 0.01, 'penalty': 'l1'} respectively in our 2 trials.

Model Training and Threshold Selection
We have built our model mainly based on two data sets: train_1_cleaned (train_1_df) and train_cleaned (train_df), and would decide which one to use according to their performance evaluated on the same test set.
Then we build a Logistic model with the train_1_cleaned set and use this model to fit into all train data (without dropping those non-default items) to see the accuracy. And we found that the accuracy of this model is rather low in the whole train data (Exhibit 5), although it performed well in train_1_cleaned, so our first attempt failed. Our next attempt is directly building a model based on the whole train data set that has been processed, which is train_cleaned.csv. We split this data set into traino set and test set, then further split traino set into train set and validation set(which we will use later for cost analysis), then build model based on train set and valuation based on test set. We found that this time the accuracy of our model is pretty high (0.919867), and the AUC (0.500804) also passed 50%, which shows the usefulness of our second trial model.
Threshold Selection: After getting the model, we adopt cost and benefit analysis on our validation set. The optimal threshold (for '0' prediction) is found to be 0.9. The 'Cost Curve' and 'Cost Comparison' are shown below: The cost of validation set incurred by a majority classifier (all predict to be non-default) is about 50.7 million ("Total Cost"). The cost decreased by 0.2% in our model without threshold adjustment (threshold being 0.5) and 27.06% after the prediction with the 0.9 threshold.

Evaluation
We have applied the model and the optimal threshold on the test set and get the results as follows: The cost is reduced by about 26.6% in test set after applying the model.

Conclusion
Although our Logistic model has pretty high accuracy after threshold adjustment (0.7507) and helps reduce costs by about 26.6%, but our AUC (0.500804) is still not so satisfactory as we expected. So, we kept trying other models to find more precise results.

KNN Model
Data Source: train_cleaned.csv("train"), train_1_cleaned.csv ("train_1") Hyper-tunning: In the model, we have used 2 parameters [neighbor count (k), distance type (p)] to assign weights to neighbors by distance. As we care more about positive labels, we use AUC as the performance index in the tunning stage. For the train_1 data, 15 combinations of the two parameters are used to tune the optimal parameters with cross-validation. For train data, 11 combinations are used to tune the model. Among the combinations, we figure out that the relative AUC maximum is achieved when k = 33/35 and p =1 for train_1; and when k=71 and p=1 for train.
Model Training and Threshold Selection We have trained 2 models with 2 datasets. The model performance is shown below: After getting the model, we adopt cost and benefit analysis on the validation set. The optimal threshold (for '0' prediction) is found to be 0.9. The 'Cost Curve' and 'Cost Comparison' are shown below: Cost graph with different threshold & Cost comparison table between all predicted by majority, predicted without threshold adjustment, and by our 0.9 ideal threshold The cost incurred by a majority classifier (all predict to be non-default) is about 43.9 million ("Total Cost"). The cost decreased by 6.09% after the prediction with default threshold, and further decreased by 14.81% with 0.9 threshold.

Evaluation
We have applied the model and the optimal threshold on test set and get the results as follows:

The AUC & reduced cost result of test set based on KNN model
The cost is reduced by about 23.26% after applying the model. KNN model can fit the data well while also have limitations. (e.g. the actual business data volume and feature size are large)

Summary of Performance
We've tried 3 types of different models, namely, Decision tree model, Logistic model and KNN model, and we got 3 different evaluations based on them.
Firstly, the decision tree model, after doing the hyperparameter tuning, the AUC of our finalized model is 0.525804 based on our separate test set. After doing the cost analysis, we used the threshold of 0.93 and under this threshold our cost can be reduced by 22.98% in our separate test set compared to majority classifier (all predict to be non-default).
Secondly, the Logistic model, after doing the hyperparameter tuning, the AUC of our finalized model is 0.500804 based on our separate test set. After doing the cost analysis, we used the threshold of 0.9 and under this threshold our cost can be reduced by 26.6% in our separate test set compared to majority classifier (all predict to be non-default).
Lastly, the KNN model, after doing the tuning for the values of k and p, the AUC of our finalized model is 0.704615 based on our separate test set. After doing the cost analysis, we used the threshold of 0.9 and under this threshold our cost can be reduced by 23.26% in our separate test set compared to majority classifier (all predict to be non-default).
In conclusion, we can find that KNN did the best in predicting default situations under all thresholds overall with the biggest AUC of 0.704615 on test set. But it didn't do that well in cost analysis under the specific threshold of 0.9, Logistic model did best in reducing cost with ideal threshold, which lead to the biggest decrease among all 3 models of 26.6%, but the AUC of it is not as high.
So, we can further ensemble them together to see whether we can achieve an even better model doing well both in prediction and reducing cost.

Ensemble Model
We have incorporated the 3 models elaborated above and tried 2 other new models (Random Forest and Adaptive Boost) to collectively predict 'default'. Results show that random forest works well on the data while adaptive boost does not. Thus, we discard adaptive boost model and use Majority Voting Classifier to build the ensemble model in 2 versions. Next, we have also explored how it performs with and without the prediction threshold (0.9) on the test set. The test results for prediction with threshold 0.9 (shown below) are much better than the default prediction.

Version 1: KNN, Logistic, Decision Tree, Random Forest
Prediction 1 (Majority Prediction) Cost comparison table between 2 prediction methods by our 0.9 ideal threshold for fusion model version 1 We have included all models at first and tried two types of prediction (> or > = 0.5). Results showed that prediction 2 works better and can reduce costs by 35.98%.

Prediction 1(Majority Prediction) Prediction 2 ( 0 if all 3 models predict 0)
Cost comparison table between 2 prediction methods by our 0.9 ideal threshold for fusion model version 2 We exclude decision tree in this version since it has a relatively poor performance separately. As we found that separate models fail to predict more 'default' than 'nondefault'. We try to recognize all the default predictions (pred 2). After comparing both, we found that majority predictions work better and can reduce costs by 38.9%.

Conclusion
Ensemble model can greatly improve the prediction quality and boost the cost reduction compared with a single model. Among all the combinations we've tested, ensemble model (KNN, logistic and random forest) with majority prediction and 0.9 threshold performs the best.

Business Application
Data Collection Advice i. Logistic: We have the table of coefficients and intercept as shown: Table of coefficient and intercepts of our logistic model As the table shown, we can find that the features with 5-top highest absolute value of coefficient (which means strong impact of the feature on the probability of prediction outcome being 1, which is default) are 'Mobile_Tag'(Mobile Number provided by Client, 1 means Yes and 0 means No), 'Score_Source_2'(the second trust score from third party), 'Score_Source_1_z'(Z score of the first trust score from third party), 'Score_Source_3_z'(Z score of the third trust score from third party), 'Client_Education_Junior secondary'(whether clients have finished the Junior secondary education).
ii. Decision Tree: Our top 10 features with the most importance are shown: top 10 features with the most importance of our decision tree model Surprisingly, the first 5 nodes in decision tree are 'Score_Source_3_z', 'Score_Source_2', 'Score_Source_1_z', 'Employed_Days_z' (Z score of Days before the application, the client started earning), 'Client_Education_Secondary', which show large similarity with those in our Logistic model. So, we can conclude that the features of Score 1, 2, 3 and whether clients have finished junior secondary education are key features in predicting default states. And we need to pay special attention to those features when collecting data, setting those features as required fields when collecting the information of clients during their application of our car loan.
Business Strategy We suggest the financial institution to reject car loan applications which predict to be 'Default'. For the car loan application with 90%-95% probability not to default, FI can consider set limits on the loan amount.
After the application stage, FI can consider set a 6 -12month time window to observe the customer behavior and collect data such as 'Day Past Due' to adjust the strategy accordingly. In our original expectation, we thought models built on train_1 would capture the essential characteristics of default cases better, which we care about the most for profit reasons. However, though models built on train_1 turn to have higher AUC in general, they perform too poorly in terms of accuracy. This may be due to the train_1 dataset having a very different proportion of default nom-default from reality, so there would be accuracy pitfalls when we try to use the model to predict some real data.

Reflections
ii). Possible interaction between features unknown: There may exist some interactional relationships between features, which we currently are unaware of. Exploration and utilization of those relationships may be beneficial to model building. b) KNN model in practice Though individually, the KKN model performed the best among all the models in terms of AUC. However, it may not be very practical for financial institutions to use in real life, as it's time-consuming and takes up too much computational power when the volume of data to be processed is huge. c) Underlying logic and reasons of feature importance We can only have importance scores of features in the decision tree and logistic models, but we are unaware of the underlying logic and reason of a feature being important. For example, 'Mobile_Tag' (Mobile Number provided by Client, 1 means Yes and 0 means No) feature ranks top among logistic model features. However, the real reason for 'Mobile_Tag' having high predictivity could possibly be that whether a mobile phone number can be provided may reflect the stability and financial state of a customer. Then we may also try to collect more information about the stability of a customer applying for a loan. If we stretch further to those underlying logic and reasons; we may collect data more effectively to achieve better prediction results. d) Sensitivity to change of cost analysis Considering the practice of selecting an optimal decision threshold based on the cost analysis, the sensitivity to change of the cost analysis should be considered. Changes in parameters like average exposure, commission rate, etc., may affect the result of the analysis, changing the optimal model.

Summary
According to the characteristics of our data (large size with a great number of features; in-balanced in general), we applied 2 different data processing methods (Dropping nondefault NAs or keeping them) and got 2 datasets.
And with our existing skill sets, we built the following individual models. And selected 3 models to be adopted based on their performance in terms of accuracy, confusion metrix, and AUC. Keeping our target of making predictions that maximize profits in mind, we did cost analysis and adjusted the thresholds to be optimal in each adopted model.
According to the feature importance ranking of decision tree model and logistic model, we can conclude that the features of Score 1, 2, 3 and whether clients have finished junior secondary education have the highest predictivity in the loan default results. And we suggest the institution outsources and pays more attention to the above 4 features in terms of data collection.
And we stretched further to incorporate the 3 models and tried 2 other new models (Random Forest and Adaptive oost) to build ensemble model with different combinations. And we also tried out 2 rules, which are 'Predict default when majority predicts default' and 'Predict default when all predict default'. Eventually, we decide that ensemble model (KNN, logistic and random forest) with majority prediction and 0.9 threshold performs the best.