Enhancing the Mathematical Reasoning Ability of Small Language Models through Thought Chain Distillation

Xiangyu Shu

doi:10.54097/7cehgc40

Authors

Xiangyu Shu

DOI:

https://doi.org/10.54097/7cehgc40

Keywords:

Thought chain distillation; Small language model; Mathematical reasoning; QLoRA; Model fine-tuning.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in reasoning tasks through the Chain of Thought (CoT) prompting technology, but their large scale makes it difficult to deploy them in resource-constrained environments. This paper explores the transfer of the reasoning capabilities of large models to small models through CoT Distillation to enhance the complex reasoning performance of the small models. The study uses the OpenR1-Math-220k dataset generated by DeepSeek R1 as the "teacher reasoning process" source, and employs the QLoRA parameter-efficient fine-tuning technique to train on the Qwen3-8B small model. Experimental results show that the fine-tuned model achieves an accuracy of 10% on the AIME-level mathematics test set compared to the baseline model, and can generate structured reasoning steps. The study verifies the effectiveness of CoT Distillation and provides a reproducible baseline framework for the deployment of small models in reasoning tasks. The paper hopes to provide researchers with future research directions.

References

[1] J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-thought prompting elicits reasoning in large language models,” arXiv preprint arXiv:2201.11903, (2022).

[2] T. Kojima, S. S. Gu, M. Reid, et al., “Large language models are zero-shot reasoners,” arXiv preprint arXiv:2205.11916, (2022).

[3] A. Lewkowycz, A. Andreassen, D. Dohan, et al., “Solving quantitative reasoning problems with language models,” arXiv preprint arXiv:2206.14858. (2022).

[4] E. Zelikman, Y. Wu, J. Mu, & N. D. Goodman, “STaR: Self-taught reasoner,” arXiv preprint arXiv:2203.14465, (2022).

[5] G. Hinton, O. Vinyals, & J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, (2015).

[6] Y. Fu, H. Peng, A. Sabharwal, et al., “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” arXiv preprint arXiv:2305.02301, (2023).

[7] C. Y. Hsieh, C. L. Li, C. K. Yeh, et al. “Teaching small models to reason,” arXiv preprint arXiv:2212.08410, (2023).

[8] Z. Li, R. Zhao, W. Wang, et al., “Symbolic chain-of-thought distillation: Small models can also solve math word problems,” arXiv preprint arXiv:2309.11540, (2023).

[9] Z. Li, J. Chen, D. Zhou, “The magic of thought: A simple and effective method for CoT distillation,” arXiv preprint arXiv:2401.07738, (2024).

[10] A. K. Lampinen, I. Dasgupta, S. Chan, et al., “Fine-tuning language models for generation with ground-truth explanations,” arXiv preprint arXiv:2204.09268, (2022).