MBTI Personality Prediction Based on BERT Classification

Hanwen Zhang

doi:10.54097/hset.v34i.5497

Authors

Hanwen Zhang

DOI:

https://doi.org/10.54097/hset.v34i.5497

Keywords:

BERT Classification, Logistic Regression, TF-IDF Matrix, NLP, MBTI.

Abstract

Young people today tend to express their feelings and socialize on the internet instead of in real life, which makes social media practical in defining one's personality since their expressions usually exhibit their personalities. Predicting people's personalities based on their posts is a relatively challenging task requiring large quantities of processing data and modeling. This paper uses two word-embedding methods, BERT classification and TF-IDF Vectorizer, and three models, including Logistic Regression, K-Nearest Neighbors, and Random Forest Classifier, to find this task's state of the art method. In this case, with BERT classification, the state-of-the-art method for most of the Natural Language Processing(NLP) tasks, Logistic Regression is the best-performing model with an average accuracy of 87 percent.

Downloads

Download data is not yet available.

References

The Myers & Briggs Foundation. 2019. MBTI® Basics.https://www.myersbriggs.org/my-mbti-personality-type/mbti-basics/

NERIS Analytics Limited. 2013. Our Framework. https://www.16personalities.com/article/our-theory

M.Farouk Radwan, MSc. 2022. How the words people say reflect their personalities. https://www.2knowmyself.com/How_the_words_people_say_reflect_their_personalities

Rajaraman, A.; Ullman, J.D. (2011). "Data Mining". Mining of Massive Datasets. pp.1–17. doi:10.1017/CBO9781139058452.002. ISBN 978-1-139-05845-2.

Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2

Cramer, J. S. (2002). The origins of logistic regression (Technical report). Vol. 119. Tinbergen Institute. pp. 167–178. doi:10.2139/ssrn.360300

Altman, Naomi S. (1992). "An introduction to kernel and nearest-neighbor nonparametric regression" (PDF). The American Statistician. 46 (3): 175–185. doi:10.1080/00031305.1992.10475879. hdl:1813/31637

Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.

Tobias Bornheim, Niklas Grieger, and Stephan Bialonski. FHAC at GermEval 2021: Identifying German toxic, engaging, and fact-claiming comments with ensemble learning. In Proc. GermEval 2021 Workshop on Identification of Toxic, Engaging, and Fact-Claiming Comments: 17th KONVENS 2021, pages 105–111, Online (2021).

Branden Chan, Stefan Schweter, and Timo Möller. 2020. German’s Next Language Model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (Online). International Committee on Computational Linguistics.

Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR.

Kaggle.com. 2022 (MBTI) Myers-Briggs Personality Type Dataset. https://www.kaggle.com/datasets/datasnaek/mbti-type

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. p. 27.

Mannor, Shie & Peleg, Dori & Rubinstein, Reuven. (2005). The cross entropy method for classification. 561-568. 10.1145/1102351.1102422.