Automatic Classification for Unlabeled Email Messages into Folders

Fucheng Zhu

doi:10.54097/hset.v34i.5432

Authors

Fucheng Zhu

DOI:

https://doi.org/10.54097/hset.v34i.5432

Keywords:

Email Classification, Unsupervised Learning, Natural Language Processing, Vector Space Model.

Abstract

Imagine returning from an excused absence because of Covid-19 or any force majeure alike, and having to immediately face 300+ unread emails; getting overwhelmed by emails has become part of office workers’ daily routine. Numerous pieces of research have shown effective methods to categorize email messages, detect potential harassment, and even automatically send a reply. But still, email is an interesting type of text to analyze and gives rise to many challenges. First discussing the challenge in the problem, this paper aims to research, study, and propose a method that can deal with a specific challenge: making folders out of income email messages and then classifying emails automatically. By cooperating basic methods, techniques, and algorithms, an intuitive program is developed that can perform the task with the given public email dataset. The method is then expected to raise prospects for future investigations and improvements in performance and robustness.

Downloads

Download data is not yet available.

References

G. Mujtaba, L. Shuib, R. G. Raj, N. Majeed and M. A. Al-Garadi. Email Classification Research Trends: Review and Open Issues [J]. IEEE Access, 2017, 5, 9044-9064, 10.1109/ACCESS.2017.2702187.

Shinjae Yoo, Yiming Yang, Frank Lin, Il-Chul Moon. Mining Social Networks for Personalized Email Prioritization [B]. ACM, 2009. 10.1145/1557019.1557124

Hesham Altwaijry, Saeed Algarny. Bayesian-based intrusion detection system [J]. Journal of King Saud University - Computer and Information Sciences, 2012 24(1): 1-6.

Klimt, B., Yang, Y. The Enron Corpus: A New Dataset for Email Classification Research. Machine Learning: ECML 2004. 3201. https://doi.org/10.1007/978-3-540-30115-8_22.

Dhillon, I.S., Fan, J., Guan, Y. Efficient Clustering of Very Large Document Collections. Data Mining for Scientific and Engineering Applications. Massive Computing, 2001, 2.

Ramos, Juan. Using tf-idf to determine word relevance in document queries [J]. Proceedings of the first instructional conference on machine learning, 2003, 242(1).

Rathi Dinesh, Michael B. Twidale. Ditch the Smileys: Customizing a Stopword List for Email-Based Data [J]. CAIS, 2013. https://doi.org/10.29173/cais394.

Ljiljana Dolamic, Jacques Savoy. When stopword lists make the difference [J]. Journal of the American Society for Information Science and Technology, 2010, 61(1): 1532-2882.

Izzat Alsmadi, Ikdam Alhama. Clustering and classification of email contents [J], Journal of King Saud University - Computer and Information Sciences, 2015, (27)1, 46-57, ISSN 1319-1578.

Faisal Rahutomo, Kitasuka Teruaki, Masayoshi Aritsugi. Semantic cosine similarity. ICAST, 2012, 4(1).