The Development and Application of Multimodal Large Models

Kejin Chen

doi:10.54097/894fjr44

Authors

Kejin Chen

DOI:

https://doi.org/10.54097/894fjr44

Keywords:

Multimodal Large Language Model, Modality Fusion, Image-Text Generation, Embodied Intelligence.

Abstract

This paper aims to systematically review recent research on mainstream multimodal large-scale model structures, training strategies, multimodal fusion architectures, and practical applications. It focuses on the current state and future trends of multimodal large language models (MLLMs). By analyzing the structures of typical models such as CLIP, BLIP-2, PaLM-E, and GPT-4V, this paper summarizes their practical applications, common cross-modal alignment methods in multimodal large models, pre-training strategies, and standard task evaluation methods used during the modeling process. At the application level, we conducted a detailed study of multi-modal large language models (MLLMs) regarding typical use cases and performance metrics in key areas such as image-based content creation, cross-modal retrieval, visual question-answering, and multi-modal dialogue. The analysis found that MLLMs still encounter significant challenges in addressing modal hallucinations, maintaining semantic consistency, improving reasoning capabilities, and lowering training costs. This paper also summarizes the current research issues and emphasizes that future multimodal models should aim for greater generalization, modularity, controllability, and low-resource adaptability. Finally, building on existing research, the paper suggests several promising directions for further exploration, including multimodal context learning (M-ICL), visual-language chain reasoning, cross-domain knowledge transfer, and the miniaturization and deployment optimization of multimodal large models.

References

[1] Radford, A., Kim, J.W., Hallacy, C., et al.: 'Learning Transferable Visual Models From Natural Language Supervision', arXiv Prepr., 2021, arXiv:2103.00020

[2] Driess, D., Xia, F., Chowdhery, A., et al.: 'PaLM-E: An Embodied Multimodal Language Model', arXiv Prepr., 2023, arXiv:2303.03378

[3] OpenAI: 'GPT-4 Technical Report', Tech. Rep., OpenAI, 2023

[4] Chen, M., Zhang, Y., Fang, Z., et al.: 'Multimodal Pretraining: A Survey', arXiv Prepr., 2023, arXiv:2302.00487

[5] Li, J., Selvaraju, R.R., Gotmare, A., et al.: 'ALBEF: Cross-modal Contrastive Learning for Vision-Language Pre-training', arXiv Prepr., 2021, arXiv:2107.07651

[6] Li, J., Li, D., Xiong, C., Hoi, S.: 'BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models', arXiv Prepr., 2023, arXiv:2301.12597

[7] Alayrac, J.B., Donahue, J., Luc, P., et al.: 'Flamingo: A Visual Language Model for Few-Shot Learning', arXiv Prepr., 2022, arXiv:2204.14198

[8] Yu, Z., Li, D., Liu, S., et al.: 'Instruction Tuning for Multimodal LLMs: Methods and Evaluations', arXiv Prepr., 2023, arXiv:2309.13036

[9] Brohan, A., Chebotar, Y., Driess, D., et al.: 'RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control', arXiv Prepr., 2023, arXiv:2307.15818

[10] OpenAI: 'GPT-4o: Introducing GPT-4 Omni', Tech. Rep., OpenAI, 2024

[11] Jiang, W., Zhang, J., Lin, X.: 'Multimodal Large Language Models: A Survey', arXiv Prepr., 2023, arXiv:2305.06541

[12] Hu, Y., Shen, J., Liu, Z., et al.: 'Survey of Vision-Language Pretrained Models', Sci. China Inf. Sci., 2023, 66, (11), 182101

[13] Zhu, D., Hu, Z., Liu, Y., et al.: 'MiniGPT-4: Enhancing Vision-Language Understanding with GPT-4', arXiv Prepr., 2023, arXiv:2304.10592

[14] Jia, C., Yang, Y., Xia, Y., et al.: 'Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision', arXiv Prepr., 2021, arXiv:2102.05918

[15] Liu, H., Li, J., Lin, J., et al.: 'Visual Instruction Tuning', arXiv Prepr., 2023, arXiv:2304.08485

[16] Zhang, Y., Li, T., Feng, Y., et al.: 'Multimodal Large Language Models: Recent Advances and Future Trends', arXiv Prepr., 2024, arXiv:2401.03461