Titanic Survival Prediction with Enhanced Random Forests
DOI:
https://doi.org/10.54097/tfxr3t12Keywords:
Titanic dataset, Random Forest, Feature Weighting, Class Imbalance, Bayesian Optimization.Abstract
This paper proposes an enhanced random forest framework designed to address key challenges in the Titanic survival prediction task, including class imbalance, feature heterogeneity, and limited sample size. The method integrates an entropy-based adaptive feature weighting mechanism to amplify the influence of critical socio-demographic features—such as gender and passenger class—during decision tree splits, thereby improving split reliability and model interpretability. To mitigate bias arising from the underrepresentation of survivors (minority class), SMOTE is employed to synthetically balance the training data. Furthermore, Bayesian optimization is utilized for efficient and robust hyperparameter tuning, enhancing generalization performance. Extensive experiments on the Kaggle Titanic dataset demonstrate that the proposed approach consistently outperforms a range of baselines—including logistic regression, SVM, standard random forests, XGBoost, and MLP—in terms of accuracy, recall, F1-score, and AUC. Ablation studies confirm the complementary contributions of each component, while error analysis reveals systematic misclassifications in specific subgroups (e.g., male third-class passengers), offering insights into model behavior and limitations. The framework not only achieves superior predictive performance but also improves fairness and stability, presenting a principled and extensible solution for classification tasks on small, imbalanced, and heterogeneous tabular datasets.
Downloads
References
[1] Howells R. Atlantic crossings: nation, class and identity in Titanic (1953) and A Night to Remember (1958). Historical Journal of Film, Radio and Television, 1999, 19 (4): 421–438.
[2] Eaton J P, Haas C A. Titanic: Triumph and Tragedy. W. W. Norton & Company, 1994.
[3] Dua D, Graff C. UCI machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml, 2019.
[4] Hosmer D W, Lemeshow S, Sturdivant R X. Applied Logistic Regression. Wiley, 2013.
[5] Cortes C, Vapnik V. Support-vector networks. Machine Learning, 1995, 20 (3): 273–297.
[6] Breiman L. Random forests. Machine Learning, 2001, 45 (1): 5–32.
[7] Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016: 785–794.
[8] Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T Y. LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems (NeurIPS), 2017: 3149–3157.
[9] Al-Hayik U H S, Abu-Naser S S. Chances of survival in the Titanic using ANN, 2023.
[10] Lundberg S M, Lee S I. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems (NeurIPS), 2017: 4765–4774.
[11] Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357.
[12] Kaggle. Titanic – machine learning from disaster. [Online]. Available: https://www.kaggle.com/c/titanic. Published: 2012-09-12, Accessed: 2025-09-22.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Academic Journal of Science and Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.








