Parameter Optimization in DeBERTa for Text Classification Using PCA on Limited GPU Resources

Yvheng Lin

doi:10.54097/cq798v58

Authors

Yvheng Lin

DOI:

https://doi.org/10.54097/cq798v58

Keywords:

DeBERTa, Principal Component Analysis (PCA), Text Classification.

Abstract

This study focuses on optimizing Decoding-enhanced Bidirectional Encoder Representations from Transformers with disentangled Attention (DeBERTa) for text classification tasks, particularly on smaller datasets with limited GPU resources, by using Principal Component Analysis (PCA) to adjust and explain parameter choices. The goal is to reduce the model's parameters without significantly compromising accuracy. Specifically, the research employs DeBERTav3 on the Internet Movie Database (IMDb) dataset, without pre-training, and uses PCA to analyse the embedding and attention components of the model. PCA is applied to identify and reduce parameters by observing the Number of smallest factors that has a given Neglected cumulative contributing Rate of principal components (NNR), iteratively refining the model until optimal efficiency is achieved. This study demonstrates that PCA can effectively guide parameter adjustments, allowing even a smaller model to maintain high accuracy. The IMDb dataset is used to validate the approach, showing that a reduced DeBERTa model can still achieve competitive performance, making it particularly useful for natural language processing tasks involving minority languages or constrained computational environments. The findings have broader implications for optimizing large language models, suggesting that future work could explore combining PCA with techniques like factorized embedding parameterization to further enhance efficiency.

Downloads

Download data is not yet available.

References

[1] Vaswani, Ashish, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.

[2] Li Q, Peng H, Li J, et al. A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST), 2022, 13 (2): 1-41.

[3] He P, Liu X, Gao J, et al. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv: 2006.03654, 2020.

[4] He P, Gao J, Chen W. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv: 2111.09543, 2021.

[5] Yang Z, et al. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 2019, 32.

[6] Devlin J, Chang M, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805, 2018.

[7] Lan Z, Chen M, Goodman S, et al. ALBERT: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv: 1909.11942, 2019.

[8] Lebret R and Ronan C. Word emdeddings through hellinger pca. arXiv preprint arXiv: 1312.5542, 2013.

[9] Liu S, et al. Visual exploration of semantic relationships in neural word embeddings. IEEE transactions on visualization and computer graphics, 2017, 24 (1): 553-562.

[10] Karl F. Transformers are short text classifiers: a study of inductive short text classifiers on benchmarks and real-world datasets. CoRR abs/2211.16878, 2022.

[11] Selva S, Kanniga R. A review on word embedding techniques for text classification. Innovative Data Communication Technologies and Application: Proceedings of ICIDCA 2020, 2021: 267-281.

[12] Lakshmipathi N, “Imdb-dataset-of-50k-movie-reviews”, 2019, https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews.

[13] Semi O, “BERTs-PCA”, 2023, https://github.com/yv-h-Lin/DeBERTa-PCA.