Predicting Loan Default: A Comparative Analysis of Multiple Machine Learning Models
DOI:
https://doi.org/10.54097/10dk2m95Keywords:
Machine learning, feature importance, financial debt.Abstract
Financial decision-making, particularly in loan approval, requires precise risk prediction. To enhance the prediction accuracy, this study utilizes various machine learning models, namely Logistic Regression, XGBoost, an Artificial Neural Network (ANN), and a hybrid XGBoost + Logistic Regression (XGB+LR). These models were selected based on their unique capacities to capture complex patterns and relationships within the data, thereby potentially improving the loan default prediction task. The training and validation of these models were performed on a meticulously prepared dataset, following crucial preprocessing steps such as one-hot encoding, feature selection, and scaling. To ensure the models' optimal performance, intensive hyperparameter tuning was conducted. The application of these techniques resulted in a robust set of models. Each model's performance was rigorously evaluated through established metrics, including the Area Under the ROC Curve (AUC) and Accuracy (ACC). Among these models, the XGBoost model demonstrated superior predictive power, achieving an AUC of 0.798 and an ACC of 0.861 on the validation set. A detailed feature importance analysis using the XGBoost model further revealed that Credit_Score and Loan_Amount were the primary factors impacting loan approval decisions. Despite slight overfitting observed in the models, the results confirm the potential of machine learning in improving financial decision-making processes. This study sets the foundation for future advancements, which may include the application of advanced regularization techniques, further hyperparameter optimization, and the inclusion of a broader feature set.
Downloads
References
Moro S Cortez P Rita P 2014 A data-driven approach to predict the success of bank telemarketing Decision Support Systems 62 22–31
Dorman Kaggle Loan Default Dataset 2022 Retrieved from
https://www.kaggle.com/datasets/yasserh/loan-default-dataset
Menard S 2002 Applied logistic regression analysis Sage
Field A 2009 Logistic regression. Discovering statistics using SPSS 264 315
Sperandei S 2014 Understanding logistic regression analysis. Biochemia medica 24(1) 12–18
Chen T He T Benesty M et al 2015 Xgboost extreme gradient boosting R package version 0 4–2 1(4) 1–4
Chen T Guestrin C 2016 Xgboost A scalable tree boosting system Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785-794
Ogunleye A Wang Q G 2019 XGBoost model for chronic kidney disease diagnosis IEEE/ACM transactions on computational biology and bioinformatics 17(6) 2131–2140
Qiu Y Wang J Jin Z et al 2022 Pose-guided matching based on deep learning for assessing quality of action on rehabilitation training Biomedical Signal Processing and Control 72 103323
Zhou Z H Jiang Y 2003 Medical diagnosis with C4 5 rule preceded by artificial neural network ensemble IEEE Transactions on information Technology in Biomedicine 7(1) 37–42
Abdelatief M A Zamel A A Ahmed S A 2019 Elliptic tube free convection augmentation: an experimental and ANN numerical approach. International Communications in Heat and Mass Transfer 108 104296
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Highlights in Science, Engineering and Technology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







