A Review on Vision-Language-Based Approaches: Challenges and Applications  

作  者:Huu-Tuong Ho Luong Vuong Nguyen Minh-Tien Pham Quang-Huy Pham Quang-Duong Tran Duong Nguyen Minh Huy Tri-Hai Nguyen 

机构地区:[1]Department of Artificial Intelligence,FPT University,Danang,550000,Vietnam [2]Department of Business,FPT University,Danang,550000,Vietnam [3]Faculty of Information Technology,School of Technology,Van Lang University,Ho Chi Minh City,70000,Vietnam

出  处:《Computers, Materials & Continua》2025年第2期1733-1756,共24页计算机、材料和连续体(英文)

摘  要:In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.

关 键 词:Bootstrapping language-image pre-training(BLIP) multimodal learning vision-language model(VLM) vision-language pre-training(VLP) 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象