A Review of Applications and Research Advances of Large Models in OCR Technology
DOI:
https://doi.org/10.54097/tg9jdp05Keywords:
OCR.Large Language Models. Natural Scene Text Recognition. Systematic Review. Paradigm Shift.Abstract
Optical Character Recognition (OCR) is a key technology for document digitization and textual information extraction. However, traditional methods are often limited by practical challenges such as font diversity, complex layouts, and image degradation. In response, this paper provides a systematic review of recent advances in OCR driven by large models. First, a review framework of “bottleneck analysis – technical response – performance evaluation” is constructed, and the paradigm shift is examined from three dimensions: model architecture, training paradigms, and application scenarios. Within this framework, we systematically categorize and compare three mainstream technical approaches: cross-modal pre-trained models, end-to-end sequence generation methods, and few-shot/zero-shot adaptation techniques. Then, through a comparative analysis of representative models in terms of accuracy, robustness, and efficiency, we highlight the significant advantages of large models in complex document understanding, natural scene text recognition, and multilingual generalization, while also noting their persistent limitations in computational cost, hallucination suppression, and domain-specific adaptation. Finally, by considering emerging trends like lightweight deployment, multimodal fusion, and privacy-preserving computation, the paper envisions OCR evolving toward an efficient, adaptive, and cognitively-enabled new paradigm, thereby providing a roadmap for future research and practice.
References
[1] A. Jain and J. Sharma, "Classification and interpretation of characters in multi-application OCR system," 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC), Delhi, India, 2014, pp. 1-6, https://doi.org/10.1109/ICDMIC.2014.6954231
[2] Xia, S. (2023). Application of Python automation tool combined with OCR technology in "financial checkup" of universities. Accounting and Corporate Management, 5(12), 130-136, https://doi.org/10.23977/acccm.2023.051220
[3] James, J. K., Maran, T., Rice, M. P., Hunt, T. S., Peterson, K. J., Hogan, W. J., Damani, S., & Ryu, A. J. (2024). Experience with an optical character recognition search application for review of outside medical records. Mayo Clinic Proceedings: Digital Health, 2(4), 511-514. https://doi.org/10.1016/j.mcpdig.2024.08.001
[4] Vij, S., Jain, A., Tayal, D., Kumar, V., Arora, R., & Arora, R. (2025). Adaptive OCR error correction for handwritten texts: A semantic and statistical approach. International Journal of Information Technology, 17(7), 4371–4377. https://doi.org/10.1007/s41870-025-02627-5
[5] Wang, X., Chen, G., Qian, G., Gao, P., Wei, X.-Y., Wang, Y., Tian, Y., & Gao, W. (2023). Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey. Machine Intelligence Research, 20(4), 447–482. https://doi.org/10.1007/s11633-022-1410-8
[6] Yao, J., Wang, X., Yang, S., & Wang, B. (2024). ViTMatte: Boosting image matting with pre-trained plain vision transformers. Information Fusion, 103, 102091. https://doi.org/10.1016/j.inffus.2023.102091
[7] Wang, Y., He, J., Wang, D., Wang, Q., Wan, B., & Luo, X. (2024). Multimodal transformers with adaptive modality weighting for multimodal sentiment analysis. Neurocomputing, 572, 127181. https://doi.org/10.1016/j.neucom.2023.127181
[8] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2019). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv. https://doi.org/10.48550/ARXIV.1912.13318
[9] Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., & Zhou, L. (2020). LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding (Version 4). arXiv. https://doi.org/10.48550/ARXIV.2012.14740
[10] Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., & Park, S. (2021). OCR-free Document Understanding Transformer (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2111.15664
[11] Xue, Y. (2023). Simulation Research on Large Language Model of Complex OCR Scene Based on Reinforcement Learning Algorithm Optimization. 2023 International Conference on Internet of Things, Robotics and Distributed Computing (ICIRDC), 738–742. https://doi.org/10.1109/icirdc62824.2023.00140
[12] Xue, Y. (2023). Simulation Research on Large Language Model of Complex OCR Scene Based on Reinforcement Learning Algorithm Optimization. 2023 International Conference on Internet of Things, Robotics and Distributed Computing (ICIRDC), 738–742. https://doi.org/10.1109/icirdc62824.2023.00140
[13] Lin, J., Yang, A., Zhang, Y., Liu, J., Zhou, J., & Yang, H. (2020). InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining (Version 4). arXiv. https://doi.org/10.48550/ARXIV.2003.13198
[14] Kim, B., Choi, S., Hwang, D., Lee, M., & Lee, H. (2023). Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2301.02903
[15] Yi, W., Cai, X., Ma, H., Fu, Z., & Zhan, Y. (2025). A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval. Electronics, 14(15), 3145. https://doi.org/10.3390/electronics14153145
[16] Chen, Y.-C., Li, W.-H., & Chen, C.-S. (2024). Open-Vocabulary Panoptic Segmentation Using Bert Pre-Training of Vision-Language Multiway Transformer Model. 2024 IEEE International Conference on Image Processing (ICIP), 2494–2500. https://doi.org/10.1109/icip51287.2024.10647459.
[17] Abdellaif, O. H., Hassan, A. N., & Hamdi, A. (2025). LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR. IEEE Transactions on Automation Science and Engineering. Advance online publication. https://arxiv.org/abs/2412.18063.
[18] Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.-C., Liu, C.-L., Jin, L., & Bai, X. (2024). OCRBench: on the hidden mystery of OCR in large multimodal models. Science China Information Sciences, 67(12). https://doi.org/10.1007/s11432-024-4235-6.
[19] Chen, Q., Zhang, X., Guo, L., Chen, F., & Zhang, C. (2025). DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2508.13238.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







