The Investigation of Performance Comparison for VGG, YOLO, and DINO in Image Classification

Yanqi Chen

doi:10.54097/9bgem219

Authors

Yanqi Chen

DOI:

https://doi.org/10.54097/9bgem219

Keywords:

YOLO, VGG, DINO, deep learning.

Abstract

The rise of artificial intelligence has led to a proliferation of deep learning models, yet there remains a noticeable shortage of comparative analyses, particularly among computer vision models rooted in different design philosophies. As such, this study seeks to delve into the strengths of various models through an examination of their structural attributes, with the aim of offering insights that can inform the development of more high-performing models in the future. This study first selects three representative models with different design ideas in their respective research directions, and preliminarily distinguishes the differences between different models. Then, through experiments on the dataset, the performance of different models is obtained, and the reasons for their current performance are analyzed. In this experiment, four models, VGG16, YOLOv5, YOLOv8, and DINOv2, were deployed and tested using the Fruit 360 dataset. The final accuracy was 0.955, 0.997, 0.998 and 0.986, respectively. The accuracy of YOLO model and DINO model was much higher than that of VGG model. The reason for this result may be related to the introduction of anchor boxes in the YOLO model and attention mechanisms in the DINO model, both of which indirectly increase the receptive area for feature extraction. The YOLOv8 model has a slight improvement in accuracy compared to the YOLOv5 model, possibly due to its use of a decoupled head, which reduces the impact of location information on classification tasks.

Downloads

Download data is not yet available.

References

Madhiarasan M, Massimo Tipaldi, and Pierluigi Siano. Analysis of artificial neural network performance based on influencing factors for temperature forecasting applications. Journal of High-Speed Networks 26.3, 2020, 209 - 223.

LeCun Yann et al. Backpropagation applied to handwritten zip code recognition. Neural computation 1.4, 1989, 541 - 551.

Simonyan Karen and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Redmon Joseph et al. You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

Bommasani Rishi, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv: 2108. 07258, 2021.

Koohfar Sahar, Wubeshet Woldemariam, and Amit Kumar. Performance Comparison of Deep Learning Approaches in Predicting EV Charging Demand. Sustainability 15.5, 2023, 4258.

Kaggle Fruits 360 https://www.kaggle.com/datasets/moltean/fruits/data, 2021.

Zhong Guoqiang, Xiao Ling and Li‐Na Wang From shallow feature learning to deep learning: Benefits from the width and depth of deep architectures. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9.1, 2019, e1255l.

Cheng Yu et al. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.

Wang Huaqing et al. An enhanced intelligent diagnosis method based on multi-sensor image fusion via improved deep learning network. IEEE Transactions on Instrumentation and measurement 69.6, 2019, 2648 - 2657.

Yang Junqing, Peng Ren and Xiaoxiao Kong. Handwriting text recognition based on faster R-CNN. 2019 Chinese Automation Congress (CAC). IEEE, 2019.

Wang Xinning, et al. Data-attention-YOLO (DAY): A comprehensive framework for mesoscale eddy identification. Pattern Recognition 131, 2022, 108870.

Caron Mathilde, et al. Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision, 2021.

Jiang Peiyuan, et al. A Review of Yolo algorithm developments. Procedia Computer Science 199, 2022, 1066 - 1073.

Lou Haitong, et al. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics 12.10, 2023, 2323.

Oquab Maxime et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv: 2304. 07193, 2023.

Qiu Yuhang et al. Pose-guided matching based on deep learning for assessing quality of action on rehabilitation training. Biomedical Signal Processing and Control 72, 2022, 103323.

Bayoudh Khaled, Fayçal Hamdaoui, and Abdellatif Mtibaa. Hybrid-COVID: a novel hybrid 2D/3D CNN based on cross-domain adaptation approach for COVID-19 screening from chest X-ray images. Physical and engineering sciences in medicine 43, 2020, 1415 - 1431.