Target Research Based on BLIP Model


  • Haisheng Song
  • Yingdong Song



Object retrieval, BLIP model, Feature extraction; Similarity measure.


Visual language pretraining (VLP) has made significant progress in improving performance on multiple visual language tasks. However, most current pre-trained models are either good at comprehension tasks or focus on generative tasks. Furthermore, performance improvements often rely primarily on expanding datasets generated by collecting noisy image-text pairs from networks that are suboptimal sources of supervision. In this paper, we propose a new VLP framework, namely BLIP, which can be flexibly applied to visual language understanding and generation tasks. BLIP effectively utilizes noisy network data by guiding subtitles. Its subtitle generator produces synthetic subtitles, and filters are used to clean these noisy subtitles. In order to meet the practical needs of existing search engines to improve retrieval speed and retrieval accuracy, this paper proposes an improved method based on the BLIP algorithm. We migrated the image and text retrieval strategy of the BLIP algorithm from itc comparison to itm comparison, and improved the model's positive and negative sample discrimination ability by using the hard-sample strategy. We further improve the retrieval accuracy of the model.


Download data is not yet available.


Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International Conference on Machine Learning. PMLR, 2022: 12888-12900.

Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[J]. arXiv preprint arXiv:2301.12597, 2023.

Li D, Li J, Hoi S C H. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing[J]. arXiv preprint arXiv:2305.14720, 2023.

She H, Chen R R, Liang D, et al. Sparse BLIP: BLind Iterative Parallel imaging reconstruction using compressed sensing[J]. Magnetic Resonance in Medicine, 2014, 71(2): 645-660.

Yarach U, Chatnuntawech I, Liao C, et al. Blip-Up Blip-Down Circular EPI (BUDA-cEPI) for Distortion-Free dMRI with Rapid Unrolled Deep Learning Reconstruction[J]. arXiv preprint arXiv:2310.15939, 2023.

Chiang C Y, Chang I H, Liao S W. BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning[J]. arXiv preprint arXiv:2309.14774, 2023.

Savić T, Brun-Laguna K, Watteyne T. Blip: Identifying Boats in a Smart Marina Environment[C]//2023 19th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT). IEEE, 2023: 710-714.

Lee C, Jang J, Lee J. Personalizing text-to-image generation with visual prompts using BLIP-2[J]. 2023.

Wu J, Cui Z, Sheng V S, et al. A Comparative Study of SIFT and its Variants[J]. Measurement science review, 2013, 13(3): 122-131.

Otero I R. Anatomy of the SIFT Method[D]. École normale supérieure de Cachan-ENS Cachan, 2015.

Bay H, Tuytelaars T, Van Gool L. Surf: Speeded up robust features[C]//Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer Berlin Heidelberg, 2006: 404-417.

Verma N K, Goyal A, Vardhan A H, et al. Object matching using speeded up robust features[C]//Intelligent and Evolutionary Systems: The 19th Asia Pacific Symposium, IES 2015, Bangkok, Thailand, November 2015, Proceedings. Springer International Publishing, 2016: 415-427.

Leutenegger S, Chli M, Siegwart R Y. BRISK: Binary robust invariant scalable keypoints[C]//2011 International conference on computer vision. Ieee, 2011: 2548-2555.

Aglave P, Kolkure V S. Implementation Of High Performance Feature Extraction Method Using Oriented Fast And Rotated Brief Algorithm[J]. Int. J. Res. Eng. Technol, 2015, 4: 394-397.

Danielsson P E. Euclidean distance mapping[J]. Computer Graphics and image processing, 1980, 14(3): 227-248.

Malkauthekar M D. Analysis of Euclidean distance and Manhattan distance measure in Face recognition[C]//Third International Conference on Computational Intelligence and Information Technology (CIIT 2013). IET, 2013: 503-507.

Guo Q, Wang C, Xiao D, et al. A lightweight open-world pest image classifier using ResNet8-based matching network and NT-Xent loss function[J]. Expert Systems with Applications, 2024, 237: 121395.

Steinlechner S, Rohweder N O, Korobko M, et al. Mitigating mode-matching loss in nonclassical laser interferometry[J]. Physical review letters, 2018, 121(26): 263602.







How to Cite

Target Research Based on BLIP Model. (2024). Academic Journal of Science and Technology, 9(1), 80-86.

Similar Articles

1-10 of 364

You may also start an advanced similarity search for this article.