Study on Small Sample Text Classification Based on Multi-Level Self-Attention and Multi-Feature Residual Fusion under Data Enhancement
DOI:
https://doi.org/10.54097/ee4gsk44Keywords:
MacBERT, Data Enhancement, Self-attention Mechanism, Text Categorization, Small SamplesAbstract
In commercial applications, while traditional models can achieve comparable performance to mainstream large language models, they generally necessitate extensive training data. This requirement presents a significant challenge when processing complex, lengthy Chinese texts and multi-label classification tasks with limited data availability. Furthermore, conventional data augmentation techniques frequently disrupt the original word order, thereby diminishing their efficacy for pre-trained language model applications. To overcome these limitations, we introduce the MacBERT-CNN-BiLSTM model, which incorporates a multi-level self-attention mechanism to dynamically weight the integrated features extracted from MacBERT, CNN, and BiLSTM components. Our methodology preserves the integrity of original features during the final fusion phase by combining weighted features with original features through residual connections, thus generating a comprehensive final representation. This approach culminates in our MacBERT-BiLSTM-CNN-ResAttNet model (MBCResAttNet), specifically designed for multi-label classification of small-sample Chinese literature abstracts. We conducted extensive evaluations of our model across three datasets: AEDA-augmented, EDA-augmented, and original samples, benchmarking against six alternative models. The empirical results demonstrate that incorporating pre-trained language models substantially enhances classification performance. Moreover, the multi-level self-attention mechanism combined with residual feature fusion effectively captures global textual patterns, resulting in significant performance improvements. In the context of pre-trained language models, AEDA demonstrates superior efficacy compared to EDA in maintaining original semantic integrity. Additionally, the residual feature fusion methodology preserves critical original information while markedly improving model performance. With the implementation of AEDA augmentation, all evaluated models exhibited performance gains exceeding 10%, with our MBCResAttNet model attaining 96.17% accuracy—representing a substantial 13.41% improvement over baseline methods.
Downloads
References
[1] Kim Y. Convolutional Neural Networks for Sentence Classification. arXiv preprint arXiv:14085882. 2014. DOI: https://doi.org/10.3115/v1/D14-1181
[2] Rawat A, Wani MA, ElAffendi M, Imran AS, Kastrati Z, Daudpota SM. Drug adverse event detection using text-based convolutional neural networks (TextCNN) technique. Electronics. 2022;11(20):3336. DOI: https://doi.org/10.3390/electronics11203336
[3] Guo B, Zhang C, Liu J, Ma X. Improving text classification with weighted word embeddings via a multi-channel TextCNN model. 2019;363:366-74. DOI: https://doi.org/10.1016/j.neucom.2019.07.052
[4] Yang Z, Emmert-Streib F. Optimal performance of Binary Relevance CNN in targeted multi-label text classification. Knowledge-Based Systems. 2024; 284:111286. DOI: https://doi.org/10.1016/j.knosys.2023.111286
[5] Thekkekara JP, Yongchareon S, Liesaputra V. An attention-based CNN-BiLSTM model for depression detection on social media text. Expert Systems with Applications. 2024; 249: 123834. DOI: https://doi.org/10.1016/j.eswa.2024.123834
[6] Devlin J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.
[7] Deng S, Li Q, Dai R, Wei S, Wu D, He Y, et al. A Chinese power text classification algorithm based on deep active learning. Applied Soft Computing. 2024;150:111067. DOI: https://doi.org/10.1016/j.asoc.2023.111067
[8] Nithya K, Krishnamoorthi M, Easwaramoorthy SV, Dhivyaa C, Yoo S, Cho J. Hybrid approach of deep feature extraction using BERT–OPCNN & FIAC with customized Bi-LSTM for rumor text classification. Alexandria Engineering Journal. 2024;90: 65-75. DOI: https://doi.org/10.1016/j.aej.2024.01.056
[9] Wilkho RS, Chang S, Gharaibeh NG. FF-BERT: A BERT-based ensemble for automated classification of web-based text on flash flood events. Advanced Engineering Informatics. 2024;59:102293. DOI: https://doi.org/10.1016/j.aei.2023.102293
[10] Hui Y, Du L, Lin S, Qu Y, Cao D, editors. Extraction and classification of tcm medical records based on bert and bi-lstm with attention mechanism. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2020: IEEE. DOI: https://doi.org/10.1109/BIBM49941.2020.9313359
[11] Liu Y. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:190711692. 2019.
[12] Chi P-H, Chung P-H, Wu T-H, Hsieh C-C, Chen Y-H, Li S-W, et al., editors. Audio albert: A lite bert for self-supervised learning of audio representation. 2021 IEEE Spoken Language Technology Workshop (SLT); 2021: IEEE. DOI: https://doi.org/10.1109/SLT48900.2021.9383575
[13] Sanh V. DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:191001108. 2019.
[14] Sun Z, Li X, Sun X, Meng Y, Ao X, He Q, et al. Chinesebert: Chinese pretraining enhanced by glyph and pinyin information. arXiv preprint arXiv:210616038. 2021. DOI: https://doi.org/10.18653/v1/2021.acl-long.161
[15] Cui Y, Che W, Liu T, Qin B, Yang Z. Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021;29:3504-14. DOI: https://doi.org/10.1109/TASLP.2021.3124365
[16] Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J. Deep learning--based text classification: a comprehensive review. ACM computing surveys (CSUR). 2021; 54(3):1-40. DOI: https://doi.org/10.1145/3439726
[17] Reusens M, Stevens A, Tonglet J, De Smedt J, Verbeke W, Vanden Broucke S, et al. Evaluating text classification: A benchmark study. Expert Systems with Applications. 2024:124302. DOI: https://doi.org/10.1016/j.eswa.2024.124302
[18] Taha K, Yoo PD, Yeun C, Homouz D, Taha A. A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights. Computer Science Review. 2024;54:100664. DOI: https://doi.org/10.1016/j.cosrev.2024.100664
[19] Shobana J, Murali M. An improved self attention mechanism based on optimized BERT-BiLSTM model for accurate polarity prediction. The Computer Journal. 2023;66(5):1279-94. DOI: https://doi.org/10.1093/comjnl/bxac013
[20] Jia C, He H, Zhou J, Li K, Li J, Wei Z. A performance degradation prediction model for PEMFC based on bi-directional long short-term memory and multi-head self-attention mechanism. International Journal of Hydrogen Energy. 2024;60:133-46. DOI: https://doi.org/10.1016/j.ijhydene.2024.02.181
[21] Cai Z, Zhang H, Zhan P, Jia X, Yan Y, Song X, et al. Multi-schema prompting powered token-feature woven attention network for short text classification. Pattern Recognition. 2024;156:110782. DOI: https://doi.org/10.1016/j.patcog.2024.110782
[22] Li G, Zhao X, Wang X. Quantum self-attention neural networks for text classification. Science China Information Sciences. 2024;67(4):142501. DOI: https://doi.org/10.1007/s11432-023-3879-7
[23] Liu C, Xu X. AMFF: A new attention-based multi-feature fusion method for intention recognition. Knowledge-based systems. 2021;233:107525. DOI: https://doi.org/10.1016/j.knosys.2021.107525
[24] Liu J, Li D, Shan W, Liu S. A feature selection method based on multiple feature subsets extraction and result fusion for improving classification performance. Applied Soft Computing. 2024; 150:111018. DOI: https://doi.org/10.1016/j.asoc.2023.111018
[25] Shorten C, Khoshgoftaar TM, Furht B. Text data augmentation for deep learning. Journal of big Data. 2021;8(1):101. DOI: https://doi.org/10.1186/s40537-021-00492-0
[26] Wei J, Zou K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:190111196. 2019. DOI: https://doi.org/10.18653/v1/D19-1670
[27] Xia F, Weng Y, Xia M, Yu Q, He S, Liu K, et al., editors. Does BERT Know Which Answer Beyond the Question? China Conference on Knowledge Graph and Semantic Computing; 2021: Springer. DOI: https://doi.org/10.1007/978-981-19-0713-5_9
[28] Karimi A, Rossi L, Prati A. AEDA: an easier data augmentation technique for text classification. arXiv preprint arXiv:210813230. 2021. DOI: https://doi.org/10.18653/v1/2021.findings-emnlp.234
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Frontiers in Computing and Intelligent Systems

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

