MMInstruct:a high-quality multi-modal instruction tuning dataset with extensive diversity  

在线阅读下载全文

作  者:Yangzhou LIU Yue CAO Zhangwei GAO Weiyun WANG Zhe CHEN Wenhai WANG Hao TIAN Lewei LU Xizhou ZHU Tong LU Yu QIAO Jifeng DAI 

机构地区:[1]School of Computer Science,Nanjing University,Nanjing 210023,China [2]Shanghai AI Laboratory,Shanghai 200232,China [3]SenseTime Research,Shanghai 200233,China [4]Department of Electronic Engineering,Tsinghua University,Beijing 100084,China [5]School of Computer Science,Fudan University,Shanghai 200433,China [6]Department of Information Engineering,The Chinese University of Hong Kong,Hong Kong 999077,China [7]School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China

出  处:《Science China(Information Sciences)》2024年第12期32-47,共16页中国科学(信息科学)(英文版)

基  金:supported by National Natural Science Foundation of China(Grant Nos.62372223,62376134);National Key R&D Program of China(Grant No.2022ZD0161300);China Mobile Zijin Innovation Institute(Grant No.NR2310J7M);Youth PhD Student Research Project under the National Natural Science Foundation of China(Grant No.623B2050)。

摘  要:Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of vision large language models(VLLMs),existing visual instruction tuning datasets include the following limitations.(1)Instruction annotation quality:despite existing VLLMs exhibiting strong performance,instructions generated by those advanced VLLMs may still suffer from inaccuracies,such as hallucinations.(2)Instructions and image diversity:the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs.To address these challenges,we construct a high-quality,diverse visual instruction tuning dataset MMInstruct,which consists of 973k instructions from 24 domains.There are four instruction types:judgment,multiplechoice,long visual question answering,and short visual question answering.To construct MMInstruct,we propose an instruction generation data engine that leverages GPT-4V,GPT-3.5,and manual correction.Our instruction generation engine enables semi-automatic,low-cost,and multi-domain instruction generation at 1/6 the cost of manual construction.Through extensive experiment validation and ablation experiments,we demonstrate that MMInstruct could significantly improve the performance of VLLMs,e.g.,the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks.The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

关 键 词:instruction tuning MULTI-MODAL MULTI-DOMAIN DATASET vision large language model 

分 类 号:H31[语言文字—英语]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象