Research And Analysis of Artificial Intelligence in Abnormal Image Recognition

Runze Tao

doi:10.54097/nv6bgz46

Authors

Runze Tao School of Computer Science and Informatics, University of Liverpool, Liverpool, the United Kingdom

DOI:

https://doi.org/10.54097/nv6bgz46

Keywords:

Computer vision, Artificial Intelligence model, Image recognition.

Abstract

Currently, more and more image recognition technologies are being used in AI, and these AIs are gradually being applied in productivity scenarios. This article talks about the performance of the two mainstream AI ChatGPT5 and Qwen VL-Max in recognizing abnormal images. This research divides the abnormal images into four categories: Anatomical Anomalies, Physical Law Violations, Functional and Contextual Incongruities, Scale and Proportion Paradoxes. It is hoped that through this research, we can discover whether different AIs have varying perceptions of abnormal images generated by divergent models, in order to determine how differences in the training dataset impact the model's performance. Through three progressively in-depth questions for each image, researchers found that in most questions, the difference between the two was not significant, and both could identify the issues in the images. However, regarding anatomical anomalies that most mainstream models struggled to detect, ChatGPT, although providing incorrect answers, reflected on them and proactively requested a comparison with normal structural models. Researchers hope that this result will provide insights for developers.

Downloads

Download data is not yet available.

References

[1] Krizhevsky A, Ilya S, and Geoffrey E. H. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, no. 6 2017: 84–90.

[2] He K M, Zhang X Y, Ren S Q, and Sun J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778. 2016.

[3] Dosovitskiy A, Lucas B, Alexander K, Dirk W, Zhai X H, Thomas U, Mostafa D et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020.

[4] Radford A, Jong W K, Chris H, Aditya R, Gabriel G, Sandhini A, Girish S et al. Learning Transferable Visual Models from Natural Language Supervision. In International Conference on Machine Learning, pp. 8748-8763. PMLR, 2021.

[5] Xiang Y, Zheng T Y, Ni Y S, Wang Y B, et al. MMMU-Pro: A More Robust Multi-Discipline Multimodal Understanding Benchmark. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024.

[6] Yang W, Wang S L, Yang H, et al. An Early Evaluation of GPT-4V(ision). arXiv preprint arXiv:2310.16534, 2023.

[7] Fu C Y, Chen P X, Shen Y H, et al. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394, 2023.

[8] Peng W, Bai S, Tan S N, et al. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191, 2024.

[9] Shuai B, Chen K Q, Liu X J, et al. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923, 2025.

[10] Sidorov O, Hu R H, Rohrbach M, and Singh A. TextCaps: a Dataset for Image Captioning with Reading Comprehension. arXiv preprint arXiv:2003.12462, 2020.