Predicting Loan Default: A Comparative Analysis of Multiple Machine Learning Models

Authors

  • Yuelin Jiang

DOI:

https://doi.org/10.54097/10dk2m95

Keywords:

Machine learning, feature importance, financial debt.

Abstract

Financial decision-making, particularly in loan approval, requires precise risk prediction. To enhance the prediction accuracy, this study utilizes various machine learning models, namely Logistic Regression, XGBoost, an Artificial Neural Network (ANN), and a hybrid XGBoost + Logistic Regression (XGB+LR). These models were selected based on their unique capacities to capture complex patterns and relationships within the data, thereby potentially improving the loan default prediction task. The training and validation of these models were performed on a meticulously prepared dataset, following crucial preprocessing steps such as one-hot encoding, feature selection, and scaling. To ensure the models' optimal performance, intensive hyperparameter tuning was conducted. The application of these techniques resulted in a robust set of models. Each model's performance was rigorously evaluated through established metrics, including the Area Under the ROC Curve (AUC) and Accuracy (ACC). Among these models, the XGBoost model demonstrated superior predictive power, achieving an AUC of 0.798 and an ACC of 0.861 on the validation set. A detailed feature importance analysis using the XGBoost model further revealed that Credit_Score and Loan_Amount were the primary factors impacting loan approval decisions. Despite slight overfitting observed in the models, the results confirm the potential of machine learning in improving financial decision-making processes. This study sets the foundation for future advancements, which may include the application of advanced regularization techniques, further hyperparameter optimization, and the inclusion of a broader feature set.

Downloads

Download data is not yet available.

References

Moro S Cortez P Rita P 2014 A data-driven approach to predict the success of bank telemarketing Decision Support Systems 62 22–31

Dorman Kaggle Loan Default Dataset 2022 Retrieved from

https://www.kaggle.com/datasets/yasserh/loan-default-dataset

Menard S 2002 Applied logistic regression analysis Sage

Field A 2009 Logistic regression. Discovering statistics using SPSS 264 315

Sperandei S 2014 Understanding logistic regression analysis. Biochemia medica 24(1) 12–18

Chen T He T Benesty M et al 2015 Xgboost extreme gradient boosting R package version 0 4–2 1(4) 1–4

Chen T Guestrin C 2016 Xgboost A scalable tree boosting system Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785-794

Ogunleye A Wang Q G 2019 XGBoost model for chronic kidney disease diagnosis IEEE/ACM transactions on computational biology and bioinformatics 17(6) 2131–2140

Qiu Y Wang J Jin Z et al 2022 Pose-guided matching based on deep learning for assessing quality of action on rehabilitation training Biomedical Signal Processing and Control 72 103323

Zhou Z H Jiang Y 2003 Medical diagnosis with C4 5 rule preceded by artificial neural network ensemble IEEE Transactions on information Technology in Biomedicine 7(1) 37–42

Abdelatief M A Zamel A A Ahmed S A 2019 Elliptic tube free convection augmentation: an experimental and ANN numerical approach. International Communications in Heat and Mass Transfer 108 104296

Downloads

Published

13-03-2024

How to Cite

Jiang, Y. (2024). Predicting Loan Default: A Comparative Analysis of Multiple Machine Learning Models. Highlights in Science, Engineering and Technology, 85, 169-175. https://doi.org/10.54097/10dk2m95