A survey of text classification: problem statement, latest methods and popular datasets

Authors

  • Siyu Tian
  • Xinyao Huang

DOI:

https://doi.org/10.54097/hset.v7i.1094

Keywords:

Text classification, Document Classification, Text Categorization

Abstract

Considering the important role text classification plays in natural language processing tasks, improving the accuracy and efficiency of text classification has been a priority in recent work. In this paper, we focus on the latest text classification methods and sort them into three categories: embedding methods, language models, and various neural networks. We summarize the state of current research and the insufficiencies which may be directions for future study.

Downloads

Download data is not yet available.

References

Abreu, J., Fred, L., Macêdo, D., & Zanchettin, C. (2019). Hierarchical Attentional Hybrid Neural Networks for Document Classification. https://doi.org/10.1007/978-3-030-30493-5_39.

Adhikari, A., Ram, A., Tang, R., & Lin, J. (2019). DocBERT: BERT for Document Classification. http://arxiv.org/abs/1904.08398.

Adhikari, A., Ram, A., Tang, R., Lin, J., & Cheriton, D. R. (2019). Rethinking Complex Neural Network Architectures for Document Classification (pp. 4046–4051). https://github.com/lancopku/SGM.

Benballa, M., Collet, S., & Picot-Clemente, R. (2020). Saagie at Semeval-2019 Task 5: From Universal Text Embeddings and Classical Features to Domain-specific Text Classification (pp. 469–475). https://github.com/shivam5992/.

Chang, W. C., Yu, H. F., Zhong, K., Yang, Y., & Dhillon, I. S. (2020). Taming Pretrained Transformers for Extreme Multi-label Text Classification. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 3163–3171. https://doi.org/10.1145/3394486.3403368.

Chen, H., & Ji, Y. (2020). Learning Variational Word Masks to Improve the Interpretability of Neural Text Classifiers. http://arxiv.org/abs/2010.00667.

Chen, H.-Y., Hu, C.-H., Wehbe, L., & Lin, S.-D. (2019). Self-Discriminative Learning for Unsupervised Document Embedding (pp. 2465–2474). Association for Computational Linguistics.

Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. http://arxiv.org/abs/2003.10555.

Demotte, P., & Ranathunga, S. (2021). Dual-State Capsule Networks for Text Classification. http://arxiv.org/abs/2109.04762.

Denk, T. I., & Reisswig, C. (2019). BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. http://arxiv.org/abs/1909.04948.

Ding, S., Shang, J., Wang, S., Sun, Y., Tian, H., Wu, H., & Wang, H. (2020). ERNIE-Doc: A Retrospective Long-Document Modeling Transformer. http://arxiv.org/abs/2012.15688.

Du, C., Chin, Z., Feng, F., Zhu, L., Gan, T., & Nie, L. (2018). Explicit Interaction Model towards Text Classification. http://arxiv.org/abs/1811.09386.

Du, J., Ott, M., Li, H., Zhou, X., & Stoyanov, V. (2020). General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference. http://arxiv.org/abs/2004.14287.

Duque, A. B., Santos, L. L. J., Macêdo, D., & Zanchettin, C. (2019). Squeezed Very Deep Convolutional Neural Networks for Text Classification. https://doi.org/10.1007/978-3-030-30487-4_16.

Garg, S., & Ramakrishnan, G. (2020). BAE: BERT-based Adversarial Examples for Text Classification. http://arxiv.org/abs/2004.01970.

Guo, B., Han, S., Han, X., Huang, H., & Lu, T. (2020). Label Confusion Learning to Enhance Text Classification Models. http://arxiv.org/abs/2012.04987.

Gupta, V., Saw, A., Nokhiz, P., Netrapalli, P., Rai, P., & Talukdar, P. (2020). P-SIF: Document Embeddings Using Partition Averaging. https://github.com/vgupta123/P-SIF.

Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. http://arxiv.org/abs/1801.06146.

Ionescu, R. T., & Butnaru, A. M. (2019). Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation. http://arxiv.org/abs/1902.08850.

Kant, N., Puri, R., Yakovenko, N., & Catanzaro, B. (2018). Practical Text Classification With Large Pre-Trained Language Models. http://arxiv.org/abs/1812.01207.

Kim, T., & Yang, J. (2018). Abstractive Text Classification Using Sequence-to-convolution Neural Networks. http://arxiv.org/abs/1805.07745.

Kowsari, K., Heidarysafa, M., Brown, D. E., Meimandi, K. J., & Barnes, L. E. (2018). RMDL: Random multimodel deep learning for classification. ACM International Conference Proceeding Series, 19–28. https://doi.org/10.1145/3206098.3206111.

Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From Word Embeddings To Document Distances. 10.

Lee, S., Lee, D., & Yu, H. (2021). Out-of-Manifold Regularization in Contextual Embedding Space for Text Classification. http://arxiv.org/abs/2105.06750.

Liu, Q., Huang, H., Gao, Y., Wei, X., Tian, Y., & Liu, L. (2018). Task-oriented Word Embedding for Text Classification (pp. 2023–2032).

Luo, D., Cheng, W., Ni, J., Yu, W., Zhang, X., Zong, B., Liu, Y., Chen, Z., Song, D., Chen, H., & Zhang, X. (2021). Unsupervised Document Embedding via Contrastive Augmentation. http://arxiv.org/abs/2103.14542.

Nikolentzos, G., Tixier, A. J.-P., & Vazirgiannis, M. (2019). Message Passing Attention Networks for Document Understanding. http://arxiv.org/abs/1908.06267.

Oh, B.-D., & Kim, Y.-S. (2020). Lightweight Text Classifier using Sinusoidal Positional Encoding.

Pang, B., & Wu, Y. N. (2021). Latent Space Energy-Based Model of Symbol-Vector Coupling for Text Generation and Classification. http://arxiv.org/abs/2108.11556.

Rajaee, S., & Pilehvar, M. T. (2021). A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space. http://arxiv.org/abs/2106.01183.

Ren, H., & Lu, H. (2018). Compositional Coding Capsule Network with K-Means Routing for Text Classification. http://arxiv.org/abs/1810.09177.

Schmidt, C. W. (2019). Improving a tf-idf weighted document vector embedding. http://arxiv.org/abs/1902.09875.

Schockaert, S., & Jameel, S. (2019). Word and Document Embedding with vMF-Mixture Priors on Context Word Vectors. http://kar.kent.ac.uk/contact.html.

Sinha, K., Dong, Y., Cheung, J. C. K., & Ruths, D. (2018). A Hierarchical Neural Attention-based Text Classifier (pp. 817–823). http://wiki.dbpedia.org/.

Thongtan, T., & Phienthrakul, T. (2019). Sentiment Classification using Document Embeddings trained with Cosine Similarity (pp. 407–414). https://github.com/tanthongtan/dv-cosine.

Wang, B. (2018). Disconnected Recurrent Neural Networks for Text Categorization (pp. 2311–2320). Association for Computational Linguistics.

Werner, M., & Laber, E. (2019). Speeding up Word Mover’s Distance and its variants via properties of distances between embeddings. http://arxiv.org/abs/1912.00509.

Wohlwend, J., Elenberg, E. R., Altschul, S., Henry, S., & Lei, T. (2019). Metric Learning for Dynamic Text Classification. http://arxiv.org/abs/1911.01026.

Wu, C., Wu, F., Qi, T., & Huang, Y. (2021). Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling. http://arxiv.org/abs/2106.01040.

Wu, L., En-Hsu Yen, I., Xu, K., Xu, F., Balakrishnan, A., Chen, P.-Y., Ravikumar, P., & Witbrock, M. J. (2018). Word Mover’s Embedding: From Word2Vec to Document Embedding (pp. 4524–4534). https://github.

Xiong, Y., Feng, Y., Wu, H., Kamigaito, H., & Okumura, M. (2021). Fusing Label Embedding into BERT: An Efficient Improvement for Text Classification (pp. 1743–1750).

Yamada, I., & Shindo, H. (2019). Neural Attentive Bag-of-Entities Model for Text Classification. http://arxiv.org/abs/1909.01259.

Yao, L., Mao, C., & Luo, Y. (2018). Graph Convolutional Networks for Text Classification. http://arxiv.org/abs/1809.05679.

Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). Big Bird: Transformers for Longer Sequences. http://arxiv.org/abs/2007.14062.

Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., & Wang, L. (2020). Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks.

Zhao, W., Ye, J., Yang, M., Lei, Z., Zhang, S., & Zhao, Z. (2018). Investigating Capsule Networks with Dynamic Routing for Text Classification.

Downloads

Published

03-08-2022

How to Cite

Tian, S., & Huang, X. (2022). A survey of text classification: problem statement, latest methods and popular datasets. Highlights in Science, Engineering and Technology, 7, 357-367. https://doi.org/10.54097/hset.v7i.1094