Research on Code Generation Technology based on LLM Pre-training

Ling Chen

doi:10.54097/scrwpt34

Authors

Ling Chen

DOI:

https://doi.org/10.54097/scrwpt34

Keywords:

LLM, pre-Training, Code Generation, NL-PL

Abstract

In recent years, the continuous improvement and rapid development of large language model (LLM) technology, and the pre-trained code generation technology involved in it has attracted extensive attention in the industry. Through LLM, the conversion from the well-known natural language (NL) to the programming language (PL) written by professional code practitioners can be realized, which greatly reduces the threshold of programming language, and has demonstrated significant performance and advantages in code generation tasks through pre-training. This paper systematically sorts out, researches and summarizes the pre-trained code generation techniques in recent years. Firstly, the development time roadmap of the pre-trained model related to code generation is extracted from the relevant research results. Secondly, the characteristics of different code generation pre-trained models are sorted out and summarized. At the same time, the evaluation mechanism and dataset for different pre-trained code generation models are given, and the research data are compared and analyzed. Finally, combined with the current development situation, the future development direction of code generation technology is prospected.

Downloads

Download data is not yet available.

References

[1] Wang, Y., Wang, W., Joty, S. R., & Hoi, S. (2021). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. ArXiv, abs/2109. 00859.

[2] Le, H., Wang, Y., Gotmare, A. D., Savarese, S., & Hoi, S. (2022). CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. ArXiv, abs/2207.01780.

[3] Wang, Y., Le, H., Gotmare, A. D., Bui, N. D. Q., Li, J., & Hoi, S. C. H. (2023). CodeT5+: Open Code Large Language Models for Code Understanding and Generation.

[4] Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv:2107.03374, 2021.

[5] Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

[6] Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svyatkovskiy, A., Fu, S., Tufano, M., Deng, S. K., Clement, C., Drain, D., Sundaresan, N., Yin, J., Jiang, D., & Zhou, M. (2020). GraphCodeBERT: Pre-training Code Representations with Data Flow.

[7] Li J, Chen P, Jia J. Motcoder: Elevating large language models with modular of thought for challenging programming tasks[J]. arXiv preprint arXiv:2312.15960, 2023.

[8] Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., & Xiong, C. (2022). CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.

[9] Fried, D., Aghajanyan, A., Lin, J., Wang, S. I., Wallace, E., Shi, F., Zhong, R., Yih, W., Zettlemoyer, L., & Lewis, M. (2022). InCoder: A Generative Model for Code Infilling and Synthesis. ArXiv, abs/2204.05999.

[10] Christopoulou, F., Lampouras, G., Gritta, M., Zhang, G., Guo, Y., Li, Z., Zhang, Q., Xiao, M., Shen, B., Li, L., Yu, H., Yan, L., Zhou, P., Wang, X., Ma, Y., Iacobacci, I., Wang, Y., Liang, G., Wei, J., Liu, Q. (2022). PanGu-Coder: Program Synthesis with Function-Level Language Modeling. ArXiv, abs/ 2207. 11280.

[11] Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., & Chen, W. (2022). CodeT: Code Generation with Generated Tests. ArXiv, abs/2207.10397.

[12] Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., & Yin, J. (2022). UniXcoder: Unified Cross-Modal Pre-training for Code Representation.

[13] Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Wang, Z., Shen, L., Wang, A., Li, Y., Su, T., Yang, Z., & Tang, J. (2023). CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, abs/2303.17568.

[14] Allal, L. ben, Li, R., Kocetkov, D., Mou, C., Akiki, C., Ferrandis, C. M., Muennighoff, N., Mishra, M., Gu, A., Dey, M., Umapathi, L. K., Anderson, C. J., Zi, Y., Poirier, J., Schoelkopf, H., Troshin, S., Abulkhanov, D., Romero, M., Lappert, M., von Werra, L. (2023). SantaCoder: don’t reach for the stars! ArXiv, abs/2301.03988, null.

[15] Wei, Y., Wang, Z., Liu, J., Ding, Y., & Zhang, L. (2023). Magicoder: Source Code Is All You Need. ArXiv, abs/2312. 02120.

[16] Li, R., Allal, L. ben, Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O., Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko, O., … Vries, H. D. (2023). StarCoder: may the source be with you! ArXiv, abs/2305.06161.

[17] Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., & Jiang, D. (2023). WizardCoder: Empowering Code Large Language Models with Evol-Instruct. ArXiv, abs/2306.08568.

[18] Yu Z, Zhang X, Shang N, et al. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation[J]. arXiv preprint arXiv:2312.14187, 2023.

[19] Guo D, Zhu Q, Yang D, et al. DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence[J]. arXiv preprint arXiv:2401.14196, 2024.

[20] Open AI, Hello GPT-4o(2024-05-18) [EB/OL], https:// openai. com/ index/ hello- gpt- 4o/.

[21] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S. and Avila, R., 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[22] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.

[23] Mistral, model fluent in 80+ programming languages(2024) [EB/OL], https://mistral.ai/news/codestral/.

[24] Anthropic, Claude 3.5 Sonnet[EB/OL], https://www. anthropic. com/news/claude-3-5-sonnet.

[25] Yang A, Yang B, Hui B, et al. Qwen2 technical report[J]. arXiv preprint arXiv:2407.10671, 2024.

[26] Bai J, Bai S, Chu Y, et al. Qwen technical report[J]. arXiv preprint arXiv:2309.16609, 2023.

[27] Meta AI, Meta Llama 3.1 (2024) [EB/OL], Retrieved from https://ai.meta.com/blog/meta-llama-3-1/.

[28] Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X., Adi, Y., Liu, J., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M. P., Ferrer, C. C., Grattafiori, A., Xiong, W., D’efossez, A., Copet, J., Synnaeve, G. (2023). Code Llama: Open Foundation Models for Code. ArXiv, abs/ 2308.12950.

[29] Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint arXiv:2307. 09288, 2023.

[30] Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models[J]. arXiv preprint arXiv: 2302.13971, 2023.

[31] Team G, Riviere M, Pathak S, et al. Gemma 2: Improving open language models at a practical size[J]. arXiv preprint arXiv: 2408. 00118, 2024.

[32] Team C G. Codegemma: Open code models based on gemma[J]. arXiv preprint arXiv:2406.11409, 2024.

[33] Team G, Mesnard T, Hardin C, et al. Gemma: Open models based on gemini research and technology[J]. arXiv preprint arXiv: 2403.08295, 2024.

[34] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv: 2403.05530.

[35] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002:311–318.

[36] Ren, S., Guo, D., Lu, S., Zhou, L., Liu, S., Tang, D., Zhou, M., Blanco, A., & Ma, S. (2020). CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. ArXiv, abs/2009. 10297.

[37] Austin J, Odena A, Nye M, et al. Program synthesis with large language models[J]. arXiv preprint arXiv:2108.07732, 2021.

[38] Yin P, Deng B, Chen E, et al. Learning to mine aligned code and natural language pairs from stack overflow [C]// Proceedings of the 15th international conference on mining software repositories. 2018: 476-486.

[39] Hendrycks D, Basart S, Kadavath S, et al. Measuring coding challenge competence with apps[J]. arXiv preprint arXiv:2105. 09938, 2021.

[40] Lu S, Guo D, Ren S, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation[J]. arXiv preprint arXiv:2102.04664, 2021.

[41] Lai, Yuhang, et al. DS-1000: A natural and reliable benchmark for data science code generation[C] //International Conference on Machine Learning. PMLR, 2023: 18319-18345.

[42] Li Y, Choi D, Chung J, et al. Competition-level code generation with alphacode[J]. Science, 2022, 378(6624): 1092-1097.

[43] Iyer S, Konstas I, Cheung A, et al. Mapping language to code in programmatic context[J]. arXiv preprint arXiv:1808.09588, 2018.

[44] Papers With Code, Code Generation [EB/OL] (2024-08-06), https: //paperswithcode.com/task/code-generation.

[45] OpenAI, openai/openai_humaneval Datasets at Hugging Face, [EB/OL],https://huggingface.co/datasets/openai/openai_humaneval/viewer/openai_humaneval/test.

[46] Orlanski G, Gittens A. Reading stackoverflow encourages cheating: adding question text improves extractive code generation[J]. arXiv preprint arXiv:2106.04447, 2021.

[47] Honarvar, S., van der Wilk, M. and Donaldson, A., 2023. Turbulence: Systematically and automatically testing instruction-tuned large language models for code[J]. arXiv preprint arXiv:2312.14856.

[48] LaBash B, Rosedale A, Reents A, et al. RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale[J]. arXiv preprint arXiv:2406.16801, 2024.

[49] Haller P, Golde J, Akbik A. Pecc: Problem extraction and coding challenges[J]. arXiv preprint arXiv:2404.18766, 2024.