A Survey of Language Priors for Visual Question Answering
DOI:
https://doi.org/10.54097/fcis.v4i2.9941Keywords:
Visual Question Answering, Multimodality, Language PriorsAbstract
In recent years, with the development of deep learning technology, visual question answering tasks have gradually attracted the attention of scientific researchers. Due to the continuous improvement of relevant large-scale standard data sets, a large number of visual questions answering research results have been released one after another, and the accuracy rate of the visual question answering model based on deep learning on the data set has been continuously improved. Recent studies have found that the previously proposed visual question answering model has different degrees of data set language prior problems, that is, the model is overly dependent on the strong phase between the question and the answer in the training process. Many articles briefly describe various research methods, and look forward to the future development direction of alleviating the prior problem of visual question answering based on the existing research.
Downloads
References
Zhang P, Goyal Y, Summers-Stay D, et al. Yin and yang: Balancing and answering binary visual questions [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 5014-5022.
Goyal Y, Khot T, Summers-Stay D, et al. Making the v in vqa matter: Elevating the role of image understanding in visual question answering [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6904-6913.
Agrawal A, Batra D, Parikh D, et al. Don't just assume; look and answer: Overcoming priors for visual question answering [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4971-4980.
Selvaraju R R, Lee S, Shen Y, et al. Taking a hint: Leveraging explanations to make vision and language models more grounded[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 2591-2600.
Li L, Gan Z, Cheng Y, et al. Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 10313-10322.
Hirota Y, Garcia N, Otani M, et al. A picture may be worth a hundred words for visual question answering [J]. https://doi. org/10. 48550/arXiv.2106.13445,2021-06-25.
Si Q, Lin Z, yu Zheng M, et al. Check It Again: Progressive Visual Question Answering via Visual Entailment [C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021: 4101-4110.
Ramakrishnan S, Agrawal A, Lee S. Overcoming language priors in visual question answering with adversarial regularization[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018: 1548-1558.
Grand G, Belinkov Y. Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. In: Proc. of the 57th Conf. on Computational Natural Language Learning.ACL, 2019. 1–13.
Gat I, Schwartz I, Schwing A, et al. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies[J]. Advances in Neural Information Processing Systems, 2020, 33: 3197-3208.
Cadene R, Dancette C, Ben-Younes H, et al. RUBi: Reducing Unimodal Biases for Visual Question Answering [C]//Neural Information Processing Systems. Curran Associates, Inc., 2019, 32: 841-852.
Clark C, Yatskar M, Zettlemoyer L. Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 4069-4082.
Han X, Wang S, Su C, et al. Greedy gradient ensemble for robust visual question answering [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1584-1593.
Selvaraju R R, Lee S, Shen Y, et al. Taking a hint: Leveraging explanations to make vision and language models more grounded [C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 2591-2600.
Chen L, Zheng Y, Niu Y, et al. Counterfactual samples synthesizing and training for robust visual question answering [J].https: //doi.org/10.48550/arXiv.2110.01013,2021-10-03.
Zhu X, Mao Z, Liu C, et al. Overcoming language priors with self-supervised learning for visual question answering [C]// Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 2021: 1083-1089.
Gokhale T, Banerjee P, Baral C, et al. MUTANT: A training paradigm for out-of-distribution generalization in visual question answering[C]//2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. Association for Computational Linguistics (ACL), 2020: 878-892.
Mahabadi R K, Henderson J. Simple but effective techniques to reduce biases. [J]https:// doi.org/ 10. 48550/ arXiv. 1909. 06321, 2020-04-23.
Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
Wu J, Mooney R J. Self-critical reasoning for robust visual question answering[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019: 8604-8614.
Kervadec C, Antipov G, Baccouche M, et al. Estimating semantic structure for the VQA answer space [J]. https: // doi. org/ 10.48550/arXiv.2006.05726,2021-04-08.
Guo Y, Nie L, Cheng Z, et al. Loss re-scaling VQA: revisiting the language prior problem from a class-imbalance view[J]. IEEE Transactions on Image Processing, 2021, 31: 227-238.
Shrestha R, Kafle K, Kanan C. A negative case analysis of visual grounding methods for VQA [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8172-8181.
Yuan D. Language bias in visual question answering: A survey and taxonomy[J]. https://doi. org/ 10. 48550/ arXiv. 2111. 08531, 2021-11-06.
Wu J, Mooney R. Self-critical reasoning for robust visual question answering[J]. Advances in Neural Information Processing Systems, 2019, 32.
Shrestha R, Kafle K, Kanan C. A negative case analysis of visual grounding methods for VQA[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8172-8181.
Kv G, Mittal A. Reducing language biases in visual question answering with visually-grounded question encoder[C]// Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. Springer International Publishing, 2020: 18-34.
Zhang L, Liu S, Liu D, et al. Rich visual knowledge-based augmentation network for visual question answering[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 32(10): 4362-4373.
Liang Z, Hu H, Zhu J. LPF: A language-prior feedback objective function for de-biased visual question answering [C]//Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 2021: 1955-1959.
Teney D, Hengel A. Actively seeking and learning from live data [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 1940-1949.
Zhou Y, Ji R, Sun X, et al. Plenty is plague: Fine-grained learning for visual question answering[J]. IEEE transactions on pattern analysis and machine intelligence, 2019, 44(2): 697-709.
Liang Z, Jiang W, Hu H, et al. Learning to contrast the counterfactual samples for robust visual question answering [C]//Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 2020: 3285-3292.
Teney D, Abbasnedjad E, van den Hengel A. Learning what makes a difference from counterfactual examples and gradient supervision[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. Springer International Publishing, 2020: 580-599.
Gokhale T, Banerjee P, Baral C, et al. Mutant: A training paradigm for out-of-distribution generalization in visual question answering[J]. https://doi. org/10. 48550/ arXiv. 2009. 08566, 2020-10-16.
Teney D, Abbasnejad E, Kafle K, et al. On the value of out-of-distribution testing: An example of goodhart's law[J]. Advances in neural information processing systems, 2020, 33: 407-417.
Teney D, Abbasnejad E, van den Hengel A. Unshuffling data for improved generalization in visual question answering [C]// Proceedings of the IEEE/CVF international conference on computer vision. 2021: 1417-1427.
Lao M, Guo Y, Liu Y, et al. A language prior based focal loss for visual question answering[C]//2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021: 1-6.
Guo Y, Nie L, Cheng Z, et al. Adavqa: Overcoming language priors with adapted margin cosine loss[J]. https:// doi.org/ 10. 48550/arXiv.2105.01993,2021-05-05.
Yang C, Feng S, Li D, et al. Learning content and context with language bias for visual question answering[C]//2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021: 1-6.
Ouyang N, Huang Q, Li P, et al. Suppressing biased samples for robust vqa[J]. IEEE Transactions on Multimedia, 2021, 24: 3405-3415.
Banerjee P, Gokhale T, Yang Y, et al. WeaQA: Weak Supervision via Captions for Visual Question Answering[J]. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021.
Jiang J, Liu Z, Liu Y, et al. X-ggm: Graph generative modeling for out-of-distribution generalization in visual question answering[C]//Proceedings of the 29th ACM international conference on multimedia. 2021: 199-208.
D. Yuan, X. Liu, Q. Wu, H. Li, F. Meng, K. N. Ngan, and L. Xu, “Empower counterfactual thinking via contrastive learning for robust visual question answering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP). IEEE, 2022, p. under review.


