Dynamic Layer Skipping for Large Language Models on Natural Language Understanding Tasks and Machine Translation Using Reinforcement Learning

Wei Xu; Xiaodong Jin

doi:10.54097/wy0g8m89

Authors

Wei Xu
Xiaodong Jin

DOI:

https://doi.org/10.54097/wy0g8m89

Keywords:

Large Language Model, Reinforcement Learning, SOTA Performance

Abstract

Large Language Models (LLMs) demonstrate remarkable proficiency in various natural language processing (NLP) tasks. However, their extensive size, resulting from the inclusion of billions of parameters across multiple layers, presents significant challenges regarding storage, training, and inference. Traditional methodologies such as model pruning and distillation are employed to decrease the size of these models, but these techniques often result in a compromise on performance retention. In this work, we propose a novel framework that uses dynamic layer skipping for different samples to accelerate the inference speed of LLMs. First, we add an adapter layer at each transformer layer to predict whether to skip the next layer or not, and we propose layer skip pretraining to recover the model’s performance. Second, we propose using reinforcement learning (RL) to optimize the model and design several strategies to stabilize the training. Extensive experiments on four natural language understanding (NLU) datasets and three machine translation datasets and ablation studies show that our method achieves SOTA performance among layer skipping methods on LLMs.

Downloads

Download data is not yet available.

References

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901.

[3] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

[4] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits rea soning in large language models. Advances in neural information processing systems, 35:24824-24837.

[5] Hyung Won Chung, Le Hou, Shayne Longpre, BarretZoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

[6] Xuanli He, Yuxiang Wu, Oana-Maria Camburu, Pasquale Minervini, and Pontus Stenetorp. 2023. Using natural language explanations to improve robust ness of in-context learning for natural language infer ence. arXiv preprint arXiv:2311.07556.

[7] Ankur Goswami, Akshata Bhat, Hadar Ohana, and Theodoros Rekatsinas. 2020. Unsupervised relation extraction from language models using constrained cloze completion. arXiv preprint arXiv:2010.06804.

[8] Xin Xu, Yuqi Zhu, Xiaohan Wang, and Ningyu Zhang. 2023. How to unleash the power of large lan guage models for few-shot relation extraction? arXiv preprint arXiv:2305.01555.

[9] Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, and Enhong Chen. 2023. Large language models for generative information extraction: A survey. arXiv preprint arXiv:2312.17617.

[10] Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shang- wei Guo, Tianwei Zhang, and Guoyin Wang. 2023. Text classification via large language models. arXiv preprint arXiv:2305.08377.

[11] Aristides Milios, Siva Reddy, and Dzmitry Bahdanau. 2023. In-context learning for text classification with many labels. In Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, pages 173-184.

[12] Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, and Issam H Laradji. 2023. Promptmix: A class boundary augmentation method for large language model distillation. arXiv preprint arXiv:2310.14192.

[13] Lochan Basyal and Mihir Sanghvi. 2023. Text summarization using large language models: Acomparative study of mpt-7b-instruct, falcon-7binstruct, and openai chat-gpt models. arXiv preprintarXiv:2310.10449.

[14] Youngjin Chae and Thomas Davidson. 2023. Large lan guage models for text classification: From zero-shot learning to fine-tuning. Open Science Foundation.

[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint ar Xiv: 1810.04805.

[16] Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and DanqiChen. 2023. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.

[17] Ziqing Yang, Yiming Cui, Xin Yao, and Shijin Wang. 2022. Gradient-based intra-attention pruning on pre-trained language models. arXiv preprint arXiv:2212.07634.

[18] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv: 1910. 01108.

[19] Chenhe Dong, Guangrun Wang, Hang Xu, Jiefeng Peng, Xiaozhe Ren, and Xiaodan Liang. 2021. Efficient bert: Progressively searching multilayer perceptron via warm-up knowledge distillation. arXiv preprint arXiv:2109.07222.

[20] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. Awq: Activation aware weight quantization for 11 lm compression and acceleration. arXiv preprint arXiv:2306.00978.

[21] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv: 2210. 17323.

[22] Surat Teerapittayanon, Bradley McDanel, and Hsiang Tsung Kung. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23 rd international conference on pattern recognition, pages 2464-2469. IEEE.

[23] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. 2020. Fastbert: a self- distilling bert with adaptive inference time. arXivpreprint arXiv:2004.02178.

[24] Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. 2020. BERT loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems. arXiv:2006.04152.

[25] Zhen Zhang, Wei Zhu, Jinfan Zhang, Peng Wang, Rize Jin, and Tae-Sun Chung. 2022. Pcee-bert: accelerating bert inference via patient and confident early exiting. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 327-338.

[26] Jingfan Zhang, Ming Tan, Pengyu Dai, and Wei Zhu. 2023. Leco: Improving early exiting via learned exits and comparison-based exiting mechanism. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 298-309.

[27] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language under-standing by generative pre-training.

[28] Bowen Shen, Zheng Lin, Yuanxin Liu, Zhengxiao Liu,Lei Wang, and Weiping Wang. 2022. Cost-eff: Col laborative optimization of spatial and temporal effi ciency with slenderized multi-exit language models. arXiv preprint arXiv: 2210.15523.

[29] Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. 2024. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853.

[30] Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler.2022. Confident adaptive language modeling. Ad vances in Neural Information Processing Systems. arXiv:2207.07061.

[31] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2020. Exploring the lim its of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1-67.

[32] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

[33] Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukher jee. 2023. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.arXiv preprint arXiv: 2307. 02628.

[34] Shuzhou Yuan, Ercong Nie, Bolei Ma, and Michael Färber. 2024. Why lift so heavy? slimming large language models by cutting off the layers. arXiv preprint arXiv:2402.11700.

[35] Wei Zhu. 2021. Leebert: Learned early exit for bert with cross-level optimization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internationaly Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2968-2980.

[36] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.

[37] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adap-tation of large language models. arXiv preprint arXiv:2106.09685.

[38] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. Lst: Ladder side-tuning for parameter and memoryefficient transfer learning. Advances in Neural Information Processing Systems, 35:12991-13005.

[39] Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8:229-256.

[40] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approxima- tion. Advances in neural information processing systems, 12.

[41] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.

[42] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level train ing with recurrent neural networks. arXiv preprint arXiv:1511.06732.

[43] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platformfor natural language understanding. arXiv preprint arXiv:1804.07461.

[44] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 con ference on empirical methods in natural language processing: system demonstrations, pages 38-45.

[45] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

[46] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205. 01068.

[47] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hess- low, Julien Launay, Quentin Malartic, et al. 2023. The falcon series of open language models. arXiv preprint arXiv:2311.16867.

[48] Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. 2021. Efficient sequence pack- ing without cross-contamination: Accelerating large language models without impacting performance. arXiv preprint arXiv:2107.02027.

[49] Ilya Loshchilov and Frank Hutter. 2017. Decou pled weight decay regularization. arXiv preprint arXiv:1711.05101.

[50] Yifei Yang, Zouying Cao, and Hai Zhao. 2024. Laco: Large language model pruning via layer collapse. arXiv preprint arXiv:2402.11187.

[51] Koyel Chakraborty, Siddhartha Bhattacharyya, and Rajib Bag. 2020. A survey of sentiment analysis from social media data. IEEE Transactions on Computational Social Systems, 7(2): 450-464.

[52] Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun Ren. 2023. Is chatgpt good at search? investigating large lan guage models as re-ranking agent. arXiv preprint arXiv:2304.09542.

[53] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy- ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv preprint arXiv: 2301. 00234.