Feature-Fusion Parallel Decoding Transformer for Image Captioning

Authors

  • Chenhao Zhu
  • Xia Ye
  • Qiduo Lu

DOI:

https://doi.org/10.54097/ajst.v2i1.905

Keywords:

Enter key words or phrases in alphabetical order, separated by commas.

Abstract

Image caption is an important research direction at the intersection of computer vision and natural language processing. It is based on object detection, enabling machines to describe image content in human language, generating sentences with correct grammar. Most of the existing methods employ a Transformer-based structure which achieve the cutting-edge performance. However, most methods focus on improving visual feature information extraction, optimizing and improving between grid features and region features, and improving the performance of the final model. In this paper, we tried to improve the final effect of the model from the perspective of the model structure and the visual features extraction. We proposed Feature-Fusion Parallel Decoding Transformer (FPDT) which adopts parallel decoding mode and uses both grid features and region features. We conducted a large number of experimental studies on the MSCOCO dataset. And FPDT's performance on MSCOCO datasets is also at the cutting edge.

Downloads

Download data is not yet available.

References

Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Cornia, Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, control and tell: A framework for generating controllable and grounded captions." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

Junhua Mao, Wei Xu, Yi Y ang, Jiang Wang, and Alan LY uille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014. 1, 2

Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Vinyals, Oriol, et al. "Show and tell: Lessons learned from the 2015 mscoco image captioning challenge." IEEE transactions on pattern analysis and machine intelligence 39.4 (2016): 652-663.

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. PMLR, 2015.

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).

Jiang, Huaizu, et al. "In defense of grid features for visual question answering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Lu, Jiasen, et al. "Knowing when to look: Adaptive attention via a visual sentinel for image captioning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Chen, Long, et al. "Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Herdade, Simao, et al. "Image captioning: Transforming objects into words." Advances in Neural Information Processing Systems 32 (2019).

Mao, Junhua, et al. "Deep captioning with multimodal recurrent neural networks (m-rnn)." arXiv preprint arXiv:1412.6632 (2014).

Cornia, Marcella, et al. "Meshed-memory transformer for image captioning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Zhang, Xuying, et al. "RSTNet: Captioning with adaptive attention on visual and non-visual words." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

Rennie, Steven J., et al. "Self-critical sequence training for image captioning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Jiang, Wenhao, et al. "Recurrent fusion network for image captioning." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

Yao, Ting, et al. "Exploring visual relationship for image captioning." Proceedings of the European conference on computer vision (ECCV). 2018.

Yang, Xu, et al. "Auto-encoding scene graphs for image captioning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

Herdade, Simao, et al. "Image captioning: Transforming objects into words." Advances in Neural Information Processing Systems 32 (2019).

Huang, Lun, et al. "Attention on attention for image captioning." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

Pan, Yingwei, et al. "X-linear attention networks for image captioning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Cornia, Marcella, et al. "Meshed-memory transformer for image captioning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Downloads

Published

14-07-2022

Issue

Section

Articles

How to Cite

Zhu, C., Ye, X., & Lu, Q. (2022). Feature-Fusion Parallel Decoding Transformer for Image Captioning. Academic Journal of Science and Technology, 2(1), 114-120. https://doi.org/10.54097/ajst.v2i1.905