Research And Analysis of Artificial Intelligence in Abnormal Image Recognition
DOI:
https://doi.org/10.54097/nv6bgz46Keywords:
Computer vision, Artificial Intelligence model, Image recognition.Abstract
Currently, more and more image recognition technologies are being used in AI, and these AIs are gradually being applied in productivity scenarios. This article talks about the performance of the two mainstream AI ChatGPT5 and Qwen VL-Max in recognizing abnormal images. This research divides the abnormal images into four categories: Anatomical Anomalies, Physical Law Violations, Functional and Contextual Incongruities, Scale and Proportion Paradoxes. It is hoped that through this research, we can discover whether different AIs have varying perceptions of abnormal images generated by divergent models, in order to determine how differences in the training dataset impact the model's performance. Through three progressively in-depth questions for each image, researchers found that in most questions, the difference between the two was not significant, and both could identify the issues in the images. However, regarding anatomical anomalies that most mainstream models struggled to detect, ChatGPT, although providing incorrect answers, reflected on them and proactively requested a comparison with normal structural models. Researchers hope that this result will provide insights for developers.
Downloads
References
[1] Krizhevsky A, Ilya S, and Geoffrey E. H. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, no. 6 2017: 84–90.
[2] He K M, Zhang X Y, Ren S Q, and Sun J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778. 2016.
[3] Dosovitskiy A, Lucas B, Alexander K, Dirk W, Zhai X H, Thomas U, Mostafa D et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020.
[4] Radford A, Jong W K, Chris H, Aditya R, Gabriel G, Sandhini A, Girish S et al. Learning Transferable Visual Models from Natural Language Supervision. In International Conference on Machine Learning, pp. 8748-8763. PMLR, 2021.
[5] Xiang Y, Zheng T Y, Ni Y S, Wang Y B, et al. MMMU-Pro: A More Robust Multi-Discipline Multimodal Understanding Benchmark. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
[6] Yang W, Wang S L, Yang H, et al. An Early Evaluation of GPT-4V(ision). arXiv preprint arXiv:2310.16534, 2023.
[7] Fu C Y, Chen P X, Shen Y H, et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394, 2023.
[8] Peng W, Bai S, Tan S N, et al. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191, 2024.
[9] Shuai B, Chen K Q, Liu X J, et al. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923, 2025.
[10] Sidorov O, Hu R H, Rohrbach M, and Singh A. TextCaps: a Dataset for Image Captioning with Reading Comprehension. arXiv preprint arXiv:2003.12462, 2020.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Frontiers in Computing and Intelligent Systems

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

