Empirical Study on the Effectiveness of DistilBERT Fine-tuning on IMDb Sentiment Classification Outperforming CNN/LSTM

Shengkai Yuan

doi:10.54097/yn47ye37

Authors

Shengkai Yuan

DOI:

https://doi.org/10.54097/yn47ye37

Keywords:

IMDb Sentiment Classification, fine-tuning, deep learning.

Abstract

Sentiment analysis of user reviews is a core Natural Language Processing (NLP) task with practical uses in real scenarios like recommendation. However, the traditional approach like training neural networks models such as CNN, RNN, and LSTM, has faced challenges in improving recognition accuracy. With the development of Language Models (LM) with their self-attention mechanism for contextual understanding, this research wants to see if Language Models exceed Neural Networks on this task. This study conducts a controlled comparison on the IMDb Large Movie Review Dataset (50K reviews) for binary sentiment classification of long movie reviews. The evaluation of four models is based on the dataset that has been cleaned with max sequence length 256, and a stratified 8:1:1 train/validation/test split with multi-seeds. On the IMDb test set, TextCNN performs with an accuracy of 0.873, LSTM reaches 0.862, while DistilBERT achieves 0.911, consistently outperforming strong CNN/LSTM baselines by about 4%. In addition, an accuracy and latency trade-off is observed: DistilBERT offers the best quality with moderate runtime, while TextCNN/LSTM deliver lower latency. Overall, results confirm that a compact pretrained model with fine-tuning provides clear quality gains on long, nuanced reviews, while traditional approaches remain attractive when speed and simplicity are the priority.

Downloads

Download data is not yet available.

References

[1] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, Learning word vectors for sentiment analysis, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, 142–150 (2011). http://www.aclweb.org/anthology/P11-1015.

[2] D. V. N. Devi, K. VijayBhaskar, Y. Pavan, Survey on detection of sarcasm in sentiment analysis, Journal of Engineering Research and Application 9 (10, Series II), 39–44 (2019). http://www.ijera.com/papers/Vol9_issue10/Series-2/G0910023944.pdf.

[3] Y. Kim, Convolutional neural networks for sentence classification, Proceedings of EMNLP 2014 (2014).

[4] J. Dodge, S. Gururangan, D. Card, R. Schwartz, N. A. Smith, Show your work: Improved reporting of experimental results, Proceedings of EMNLP-IJCNLP 2019 (2019).

[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805 (2019).

[6] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT: Smaller, faster, cheaper and lighter, arXiv preprint arXiv: 1910.01108 (2019).

[7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems (NeurIPS) 30 (2017). https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

[8] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, J. Gao, Deep learning–based text classification: A comprehensive review, ACM Computing Surveys 54 (3), Article 62, 1–40 (2021). https://doi.org/10.1145/3439726.

[9] G. W. Lindsay, Attention in psychology, neuroscience, and machine learning, Frontiers in Computational Neuroscience 14, 29 (2020).

[10] M. Maurya, N. Yadav, A comparative analysis of gradient-based optimization methods for machine learning problems, International Conference on Data Analytics and Computing, Singapore: Springer Nature Singapore, 85–102 (2022).