MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity

作　　者：Yangzhou LIU Yue CAO Zhangwei GAO Weiyun WANG Zhe CHEN Wenhai WANG Hao TIAN Lewei LU Xizhou ZHU Tong LU Yu QIAO Jifeng DAI

机构地区：[1]School of Computer Science,Nanjing University,Nanjing 210023,China [2]Shanghai AI Laboratory,Shanghai 200232,China [3]SenseTime Research,Shanghai 200233,China [4]Department of Electronic Engineering,Tsinghua University,Beijing 100084,China [5]School of Computer Science,Fudan University,Shanghai 200433,China [6]Department of Information Engineering,The Chinese University of Hong Kong,Hong Kong 999077,China [7]School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China

出　　处：《Science China(Information Sciences)》2024年第12期32-47,共16页中国科学(信息科学)(英文版)

基　　金：supported by National Natural Science Foundation of China(Grant Nos.62372223,62376134);National Key R&D Program of China(Grant No.2022ZD0161300);China Mobile Zijin Innovation Institute(Grant No.NR2310J7M);Youth PhD Student Research Project under the National Natural Science Foundation of China(Grant No.623B2050)。

摘　　要：Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of vision large language models(VLLMs),existing visual instruction tuning datasets include the following limitations.(1)Instruction annotation quality:despite existing VLLMs exhibiting strong performance,instructions generated by those advanced VLLMs may still suffer from inaccuracies,such as hallucinations.(2)Instructions and image diversity:the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs.To address these challenges,we construct a high-quality,diverse visual instruction tuning dataset MMInstruct,which consists of 973k instructions from 24 domains.There are four instruction types:judgment,multiplechoice,long visual question answering,and short visual question answering.To construct MMInstruct,we propose an instruction generation data engine that leverages GPT-4V,GPT-3.5,and manual correction.Our instruction generation engine enables semi-automatic,low-cost,and multi-domain instruction generation at 1/6 the cost of manual construction.Through extensive experiment validation and ablation experiments,we demonstrate that MMInstruct could significantly improve the performance of VLLMs,e.g.,the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks.The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

关键词：instruction tuning MULTI-MODAL MULTI-DOMAIN DATASET vision large language model

分类号：H31[语言文字—英语]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索