Rethinking Semantic Contrastive Learning and Content Fusion in Multimodal Retrieval
DOI:
https://doi.org/10.54097/Keywords:
Image-Text matching, Cross-Modal retrieval, Image-Text contrastive learning, Zero-Shot retrievalAbstract
In the domain of image-text retrieval (ITR), recent advancements have enabled fine-grained (FG) instance-level retrieval through large-scale visual language pre-training (VLP). While these methods have achieved high accuracy, they have also led to an increase in computational complexity. primary challenges in cross-modal retrieval involve the induction of isomorphic knowledge and the association of heterogeneous knowledge. Homogeneous knowledge comprises elements with identical dimensions, whereas the interrelation of heterogeneous knowledge necessitates a prior unification of internal elements. Traditional cross-modal methods typically extract features from various modalities and engage in joint training. However, experimental results indicate that performance discrepancies among different modal networks can adversely affect overall generalization capability. Current state-of-the-art visual systems aim to minimize constrained supervisory signals to enhance the model's generalization performance. Although end-to-end models can simplify training, they often result in an exponential increase in data volume, which can be unmanageable for the average user. Our research demonstrates that pre-training with a singular focus can efficiently and scalably learn semantic features. This novel model is conceptually straightforward and can be implemented using existing, mature modules. In terms of performance, each module maintains a singular responsibility, significantly improving both the model's parameter count and training speed.
References
[1] Baltrušaitis, T., Ahuja, C., & Morency, L. (2017). Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 423-443.
[2] Chen, Z., Liu, G., Zhang, B., Ye, F., Yang, Q., & Wu, L.Y. (2022). AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities. ArXiv, abs/2211.06679.
[3] Chen, J., Hu, H., Wu, H., Jiang, Y., & Wang, C.L. (2020). Learning the Best Pooling Strategy for Visual Semantic Embedding. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15784-15793.
[4] Cao, M., Li, S., Li, J., Nie, L., & Zhang, M. (2022). Image-text Retrieval: A Survey on Recent Research and Development. ArXiv, abs/2203.14713.
[5] Chen, Y., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., & Liu, J. (2019). UNITER: UNiversal Image-TExt Representation Learning. European Conference on Computer Vision.
[6] Chung, J., Gülçehre, Ç., Cho, K., & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. ArXiv, abs/1412.3555.
[7] Chua, T., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y. (2009). NUS-WIDE: a real-world web image database from National University of Singapore. ACM International Conference on Image and Video Retrieval.
[8] Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., & Han, J. (2020). IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12652-12660.
[9] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.
[10] Diao, H., Zhang, Y., Ma, L., & Lu, H. (2021). Similarity Reasoning and Filtration for Image-Text Matching. ArXiv, abs/2101.01368.
[11] Faghri, F., Fleet, D.J., Kiros, J.R., & Fidler, S. (2017). VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. British Machine Vision Conference.
[12] Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). DeViSE: A Deep Visual-Semantic Embedding Model. Neural Information Processing Systems.
[13] Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. ArXiv, abs/2102.05918.
[14] Karpathy, A., & Fei-Fei, L. (2014). Deep visual-semantic alignments for generating image descriptions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128-3137.
[15] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning.
[16] Huiskes, M.J., & Lew, M.S. (2008). The MIR flickr retrieval evaluation. Multimedia Information Retrieval.
[17] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778.
[18] Li, X., Yin, X., Li, C., Hu, X., Zhang, P., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., & Gao, J. (2020). Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. ArXiv, abs/2004.06165.
[19] Li, J., Li, D., Xiong, C., & Hoi, S.C. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. International Conference on Machine Learning.
[20] Lee, K., Chen, X., Hua, G., Hu, H., & He, X. (2018). Stacked Cross Attention for Image-Text Matching. ArXiv, abs/1803.08024.
[21] Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common Objects in Context. European Conference on Computer Vision.
[22] Li, J., Selvaraju, R.R., Gotmare, A., Joty, S.R., Xiong, C., & Hoi, S.C. (2021). Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. Neural Information Processing Systems.
[23] Mu, N., Kirillov, A., Wagner, D.A., & Xie, S. (2021). SLIP: Self-supervision meets Language-Image Pre-training. ArXiv, abs/2112.12750.
[24] Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556.
[25] Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. Neural Information Processing Systems.
[26] Wang, B., Yang, Y., Xu, X., Hanjalic, A., & Shen, H.T. (2017). Adversarial Cross-Modal Retrieval. Proceedings of the 25th ACM international conference on Multimedia.
[27] Xu, J., Mello, S.D., Liu, S., Byeon, W., Breuel, T., Kautz, J., & Wang, X. (2022). GroupViT: Semantic Segmentation Emerges from Text Supervision. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18113-18123.
[28] Yang, Y., Ye, H., Zhan, D., & Jiang, Y. (2015). Auxiliary Information Regularized Machine for Multiple Modality Feature Learning. International Joint Conference on Artificial Intelligence.
[29] Zhen, L., Hu, P., Wang, X., & Peng, D. (2019). Deep Supervised Cross-Modal Retrieval. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10386-10395.
[30] Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., & Shen, Y. (2017). Dual-path Convolutional Image-Text Embeddings with Instance Loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16, 1 – 23.
Downloads
Published
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.