A Survey of Monocular Depth Estimation based on Deep Learning
DOI:
https://doi.org/10.54097/7IuosCDgKeywords:
Depth Estimation, Monocular Depth Estimation, Supervised, UnsupervisedAbstract
Depth information is very important for machines to perceive the environment and estimate their own state. Significant advances in robotics engineering and self-driving cars in recent decades have increased the demand for accurate depth measurements. Traditional depth estimation methods include motion structure and stereo vision matching, but these are based on the feature correspondence of multiple viewpoints, and at the same time, the predicted depth map is sparse. Depth estimation is a traditional task in computer vision that can be properly predicted by applying a variety of procedures, whereas inferring depth information from a single image is an ill-posed problem. The main objective of this paper is to provide a brief overview of the development of monocular depth estimation techniques based on deep learning. This article attempts to give an overview of supervised, unsupervised, and datasets and evaluation metrics. We conclude with a brief analysis of future developments.
Downloads
References
Zhou B, Krähenbühl P, Koltun V. Does computer vision matter for action?[J]. Science Robotics, 2019, 4(30): eaaw6661.
Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? the kitti vision benchmark suite[C]//2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012: 3354-3361.
Kazmi W, Foix S, Alenyà G, et al. Indoor and outdoor depth imaging of leaves with time-of-flight and stereo vision sensors: Analysis and comparison[J]. ISPRS journal of photogrammetry and remote sensing, 2014, 88: 128-146.
Wöhler C, d’Angelo P, Krüger L, et al. Monocular 3D scene reconstruction at absolute scale[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2009, 64(6): 529-540.
Srinivasan P P, Garg R, Wadhwa N, et al. Aperture supervision for monocular depth estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6393-6401.
Hou Y, Peng J, Hu Z, et al. Planarity constrained multi-view depth map reconstruction for urban scenes[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2018, 139: 133-145.
Mostegel C, Fraundorfer F, Bischof H. Prioritized multi-view stereo depth map generation using confidence prediction[J]. ISPRS journal of photogrammetry and remote sensing, 2018, 143: 167-180.
Zeller N, Quint F, Stilla U. Depth estimation and camera calibration of a focused plenoptic camera for visual odometry[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2016, 118: 83-100.
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[J]. Advances in neural information processing systems, 2014, 27.
Laina I, Rupprecht C, Belagiannis V, et al. Deeper depth prediction with fully convolutional residual networks[C]//2016 Fourth international conference on 3D vision (3DV). IEEE, 2016: 239-248.
Fu H, Gong M, Wang C, et al. Deep ordinal regression network for monocular depth estimation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2002-2011.
Garg R, Bg V K, Carneiro G, et al. Unsupervised cnn for single view depth estimation: Geometry to the rescue[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016: 740-756.
Zhou T, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1851-1858.
Ullman S. The interpretation of structure from motion[J]. Proceedings of the Royal Society of London. Series B. Biological Sciences, 1979, 203(1153): 405-426.
Mancini F, Dubbini M, Gattelli M, et al. Using unmanned aerial vehicles (UAV) for high-resolution reconstruction of topography: The structure from motion approach on coastal environments[J]. Remote sensing, 2013, 5(12): 6880-6898.
Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM: a versatile and accurate monocular SLAM system[J]. IEEE transactions on robotics, 2015, 31(5): 1147-1163.
Szeliski R, Kang S B. Shape ambiguities in structure from motion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(5): 506-512.
Zou L, Li Y. A method of stereo vision matching based on OpenCV[C]//2010 International conference on audio, language and image processing. IEEE, 2010: 185-190.
Cao Z L, Yan Z H, Wang H. Summary of binocular stereo vision matching technology[J]. Journal of Chongqing University of Technology (Natural Science), 2015, 29(2): 70-75.
Benosman R, Manière T, Devars J. Multidirectional stereovision sensor, calibration and scenes reconstruction [C]// Proceedings of 13th International Conference on Pattern Recognition. IEEE, 1996, 1: 161-165.
Ramírez-Hernández L R, Rodríguez-Quinoñez J C, Castro-Toscano M J, et al. Improve three-dimensional point localization accuracy in stereo vision systems using a novel camera calibration method[J]. International Journal of Advanced Robotic Systems, 2020, 17(1): 1729881419896717.
Tateno K, Tombari F, Laina I, et al. Cnn-slam: Real-time dense monocular slam with learned depth prediction[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6243-6252.
Yoneda K, Tehrani H, Ogawa T, et al. Lidar scan feature for localization with highly precise 3-D map[C]//2014 IEEE Intelligent Vehicles Symposium Proceedings. IEEE, 2014: 1345-1350.
Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from rgbd images[C]//Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer Berlin Heidelberg, 2012: 746-760..
Liu F, Shen C, Lin G, et al. Learning depth from single monocular images using deep convolutional neural fields[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 38(10): 2024-2039.
Geiger A, Lenz P, Stiller C, et al. Vision meets robotics: The kitti dataset[J]. The International Journal of Robotics Research, 2013, 32(11): 1231-1237.
Xie J, Kiefel M, Sun M T, et al. Semantic instance annotation of street scenes by 3d to 2d label transfer[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016: 3688-3697.
Engel J, Schöps T, Cremers D. LSD-SLAM: Large-scale direct monocular SLAM[C]//European conference on computer vision. Cham: Springer International Publishing, 2014: 834-849.
Kendall A, Cipolla R. Geometric loss functions for camera pose regression with deep learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 5974-5983.
Saxena A, Sun M, Ng A Y. Make3d: Learning 3d scene structure from a single still image[J]. IEEE transactions on pattern analysis and machine intelligence, 2008, 31(5): 824-840.
Godard C, Mac Aodha O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 270-279. 9
Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3213-3223.
dos Santos Rosa N, Guizilini V, Grassi V. Sparse-to-continuous: Enhancing monocular depth estimation using occupancy maps[C]//2019 19th International Conference on Advanced Robotics (ICAR). IEEE, 2019: 793-800.
Ramos F, Ott L. Hilbert maps: Scalable continuous occupancy mapping with stochastic gradient descent[J]. The International Journal of Robotics Research, 2016, 35(14): 1717-1730.
Li B, Dai Y, He M. Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference[J]. Pattern Recognition, 2018, 83: 328-339.
Zou H, Xian K, Yang J, et al. Mean-variance loss for monocular depth estimation[C]//2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019: 1760-1764.
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2650-2658.
Ladicky L, Shi J, Pollefeys M. Pulling things out of perspective[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 89-96.
Liu M, Salzmann M, He X. Discrete-continuous depth estimation from a single image[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014: 716-723.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction [C]// Proceedings of the IEEE/CVF international conference on computer vision. 2021: 12179-12188.
Bhat S F, Alhashim I, Wonka P. Adabins: Depth estimation using adaptive bins[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 4009-4018.
Xie J, Girshick R, Farhadi A. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks [C]// Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016: 842-857.
Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks[J]. Advances in neural information processing systems, 2015, 28.
Tosi F, Aleotti F, Poggi M, et al. Learning monocular depth estimation infusing traditional stereo knowledge[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9799-9809.
Wong A, Soatto S. Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 5644-5653.
Bozorgtabar B, Rad M S, Mahapatra D, et al. Syndemo: Synergistic deep feature alignment for joint learning of depth and ego-motion[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4210-4219.
Mayer N, Ilg E, Hausser P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4040-4048.
Prasad V, Bhowmick B. Sfmlearner++: Learning monocular depth & ego-motion using meaningful geometric constraints [C]// 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019: 2087-2096.
Klodt M, Vedaldi A. Supervising the new with the old: learning sfm from sfm[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 698-713.
Mur-Artal R, Tardós J D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras[J]. IEEE transactions on robotics, 2017, 33(5): 1255-1262.
Aleotti F, Tosi F, Poggi M, et al. Generative adversarial networks for unsupervised monocular depth prediction[C]// Proceedings of the European conference on computer vision (ECCV) workshops. 2018: 0-0.
Mehta I, Sakurikar P, Narayanan P J. Structured adversarial training for unsupervised monocular depth estimation [C]// 2018 International Conference on 3D Vision (3DV). IEEE, 2018: 314-323.
Zhao C, Sun Q, Zhang C, et al. Monocular Depth Estimation Based on Deep Learning: An Overview[J]. Science China Technological Sciences, 2020, 63(9): 1612-1627.
Masoumian A, Rashwan H A, Cristiano J, et al. Monocular Depth Estimation Using Deep Learning: A Review[J]. Sensors, 2022, 22(14): 5353.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Frontiers in Computing and Intelligent Systems

This work is licensed under a Creative Commons Attribution 4.0 International License.

