中文医学知识大模型问答语料数据集构建研究  

Study on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models

在线阅读下载全文

作  者:吕婷钰 李晓瑛[1] 张颖 刘宇炀 杜晋华 李心怡 罗妍 唐小利[1] 任慧玲[1] 刘辉 尹浩[2] LYU Tingyu;LI Xiaoying;ZHANG Ying;LIU Yuyang;DU Jinhua;LI Xinyi;LUO Yan;TANG Xiaoli;REN Huiling;LIU Hui;YIN Hao(Institute of Medical Information&Library,Chinese Academy of Medical Sciences&Peking Union Medical College,Beijing 100005,China;Research Center for Information Science and Technology,Tsinghua University,Beijing 100084,China)

机构地区:[1]中国医学科学院/北京协和医学院医学信息研究所/图书馆,北京100005 [2]清华大学网络大数据研究中心,北京100084

出  处:《医学信息学杂志》2024年第5期20-25,共6页Journal of Medical Informatics

基  金:国家社会科学基金项目(项目编号:20BTQ062);中央高校基本科研业务费资助项目(项目编号:3332023163)。

摘  要:目的/意义构建中文医学知识问答语料数据集,为医学垂域大模型提供标准化的评测基准,进而提升大模型处理中文医学问答任务的准确率和效率。方法/过程构建中文医学论文知识问答数据集、医学名词解释问答数据集和以中国执业医师资格考试真题为基础的问答数据集,整理相关开源数据集。结果/结论自主构建的中文医学知识问答语料数据集丰富了中文医学问答语料来源,能够作为一项标准化的评测基准,推动医学领域大模型实现客观全面的定量评估,今后将利用电子病历、在线健康社区等数据,为健康中国战略的实施提供更坚实的人工智能支持。Purpose/Significance To construct a Chinese medical knowledge Q&A corpus dataset as a standardized evaluation benchmark for large language models(LLMs)in the medical domain,so as to improve the accuracy and efficiency of LLMs in handling Chinese medical questions.Method/Process Chinese medical paper knowledge,medical terminology explanations and supplementary questions are acquired from the Chinese medical licensing examination,and open-source Chinese medical Q&A datasets are encompassed in the developed Q&A datasets.Result/Conclusion The Chinese medical knowledge Q&A corpus datasets enrich the sources of existing datasets and promote the objective and comprehensive quantitative evaluation of large models in the medical field.In the near future,additional data such as electronic medical records and those from online health communities will be used to strengthen the support of artificial intelligence for the Healthy China strategy.

关 键 词:大语言模型 语料数据集 模型评测 医学 

分 类 号:R-058[医药卫生]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象