检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Yangzhou LIU Yue CAO Zhangwei GAO Weiyun WANG Zhe CHEN Wenhai WANG Hao TIAN Lewei LU Xizhou ZHU Tong LU Yu QIAO Jifeng DAI
机构地区:[1]School of Computer Science,Nanjing University,Nanjing 210023,China [2]Shanghai AI Laboratory,Shanghai 200232,China [3]SenseTime Research,Shanghai 200233,China [4]Department of Electronic Engineering,Tsinghua University,Beijing 100084,China [5]School of Computer Science,Fudan University,Shanghai 200433,China [6]Department of Information Engineering,The Chinese University of Hong Kong,Hong Kong 999077,China [7]School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China
出 处:《Science China(Information Sciences)》2024年第12期32-47,共16页中国科学(信息科学)(英文版)
基 金:supported by National Natural Science Foundation of China(Grant Nos.62372223,62376134);National Key R&D Program of China(Grant No.2022ZD0161300);China Mobile Zijin Innovation Institute(Grant No.NR2310J7M);Youth PhD Student Research Project under the National Natural Science Foundation of China(Grant No.623B2050)。
摘 要:Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of vision large language models(VLLMs),existing visual instruction tuning datasets include the following limitations.(1)Instruction annotation quality:despite existing VLLMs exhibiting strong performance,instructions generated by those advanced VLLMs may still suffer from inaccuracies,such as hallucinations.(2)Instructions and image diversity:the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs.To address these challenges,we construct a high-quality,diverse visual instruction tuning dataset MMInstruct,which consists of 973k instructions from 24 domains.There are four instruction types:judgment,multiplechoice,long visual question answering,and short visual question answering.To construct MMInstruct,we propose an instruction generation data engine that leverages GPT-4V,GPT-3.5,and manual correction.Our instruction generation engine enables semi-automatic,low-cost,and multi-domain instruction generation at 1/6 the cost of manual construction.Through extensive experiment validation and ablation experiments,we demonstrate that MMInstruct could significantly improve the performance of VLLMs,e.g.,the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks.The code and data shall be available at https://github.com/yuecao0119/MMInstruct.
关 键 词:instruction tuning MULTI-MODAL MULTI-DOMAIN DATASET vision large language model
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.222.251.131