A Survey on Transformer for Optical Character Recognition

Authors

  • Sicheng Zhou

DOI:

https://doi.org/10.54097/fdgc6p75

Keywords:

Optical Character Recognition, Document Analysis, Template Matching Motion Analysis, Hidden Markov Models, Convolutional Recurrent Neural Network.

Abstract

Optical Character Recognition (OCR) transforms visual text into machine-readable form, supporting the large-scale digitization of printed, handwritten, and scene-based documents. Early approaches, such as template matching and motion analysis, relied on handcrafted patterns and were constrained to limited fonts and simple layouts. The introduction of statistical models, including Hidden Markov Models and Conditional Random Fields, expanded OCR capabilities through probabilistic sequence modeling. With the rise of deep learning, Convolutional and Recurrent Neural Networks enabled end-to-end recognition, reducing dependence on manual feature engineering and improving performance on noisy or cursive text. More recently, transformer-based models like TrOCR have redefined OCR by leveraging self-attention and large-scale pretraining, achieving state-of-the-art results across multilingual and domain-specific applications. These models excel in cross-lingual transfer, low-resource adaptation, and specialized domains such as biomedical and historical text recognition, while integrating pretrained vision–language components for greater robustness against degraded inputs. Despite these advances, challenges persist in adversarial robustness, complex document layout understanding, and fairness across underrepresented languages and scripts. Emerging research directions include zero-shot and few-shot learning, modular adapters for scalable multilingual OCR, post-OCR correction pipelines, efficiency improvements, and privacy-preserving inference. This survey outlines OCR’s historical progression, highlights deep learning and transformer-based breakthroughs, and points to future work needed to address enduring challenges in this critical field of document analysis.

Downloads

Download data is not yet available.

References

[1] Ahonen H, Heinonen O, Klemettinen M, et al. Applying data mining techniques for descriptive phrase extraction in digital document collections [C]//Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL’98). IEEE, 1998: 2-11. doi:10.1109/ ADL.1998.670375.

[2] Singh A, Bacchuwar K, Bhasin A. A survey of OCR applications [J]. International Journal of Machine Learning and Computing, 2012, 2 (4): 314-318. doi:10.7763/IJMLC.2012.V2.137.

[3] Nguyen T T H, Jatowt A, Coustaty M, et al. Survey of post-OCR processing approaches [J]. ACM Computing Surveys, 2022, 54 (6): 124:1-124:37. doi:10.1145/3453476.

[4] Mir Asif A, Hannan S A, Perwej Y, et al. An overview and applications of optical character recognition [J]. International Journal of Computer Science and Information Technologies, 2014, 5 (4): 4587-4590.

[5] Tappert C C, Suen C Y, Wakahara T. The state of the art in online handwriting recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1990, 12 (8): 787-808. doi:10.1109/34.57669.

[6] Drobac S, Lindén K. Optical character recognition with neural networks and post-correction with finite state methods [J]. International Journal on Document Analysis and Recognition (IJDAR), 2020, 23 (3): 279-295. doi:10.1007/s10032-020-00359-9.

[7] Song C, Shmatikov V. Fooling OCR systems with adversarial text images [EB/OL]. arXiv:1802.05385, 2018. doi:10.48550/arXiv.1802.05385.

[8] Hartley R T, Crumpton K. Quality of OCR for degraded text images [C]//Proceedings of the Fourth ACM Conference on Digital Libraries (DL’99). ACM, 1999: 228-229. doi:10.1145/313238.313387.

[9] Graves A, Liwicki M, Fernández S, et al. A novel connectionist system for unconstrained handwriting recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31 (5): 855-868. doi:10.1109/TPAMI.2008.137.

[10] Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition [EB/OL]. arXiv:1507.05717, 2015. https://arxiv.org/abs/ 1507.05717.

[11] Wang J. A study of the OCR development history and directions of development [J]. Highlights in Science, Engineering and Technology, 2023, 72: 409-415. doi:10.54097/bm665j77.

[12] Li M, Lv T, Chen J, et al. TrOCR: Transformer-based optical character recognition with pre-trained models [EB/OL]. arXiv:2109.10282, 2021. doi:10.48550/arXiv.2109.10282.

[13] Lauar F, Laurent V. Spanish TrOCR: Leveraging transfer learning for language adaptation [EB/OL]. arXiv:2407.06950, 2024. doi:10.48550/arXiv.2407.06950.

[14] Cheema M D A, Shaiq M D, Mirza F, et al. Adapting multilingual vision–language transformers for low-resource Urdu optical character recognition (OCR) [J]. PeerJ Computer Science, 2024, 10: e1964. doi:10.7717/peerj-cs.1964.

Downloads

Published

29-01-2026

Issue

Section

Articles

How to Cite

Zhou, S. (2026). A Survey on Transformer for Optical Character Recognition. Academic Journal of Science and Technology, 19(2), 395-400. https://doi.org/10.54097/fdgc6p75