PeMeBench:中文儿科医疗问答基准测试方法  

PeMeBench:Chinese pediatric medical Q&A benchmark testing method

在线阅读下载全文

作  者:张芊 陈攀峰 冯林坤 刘淑钰 马丹 陈梅[1,2] 李晖 ZHANG Qian;CHEN Panfeng;FENG Linkun;LIU Shuyu;MA Dan;CHEN Mei;LI Hui(State Key Laboratory of Public Big Data,Guiyang 550000,China;College of Computer Science and Technology,Guizhou University,Guiyang 550000,China)

机构地区:[1]公共大数据国家重点实验室,贵州贵阳550000 [2]贵州大学计算机科学与技术学院,贵州贵阳550000

出  处:《大数据》2024年第5期28-44,共17页Big Data Research

基  金:国家自然科学基金项目(No.61462012);2023年贵州省科技计划项目(黔科合支撑[2023]一般276);2023年贵州省科技成果应用及产业化计划项目(黔科合成果[2023]一般010)。

摘  要:大语言模型在医疗领域显现出巨大的应用潜力,如何评估其在医疗领域中的性能成为挑战。现有医疗评测基准测试多为选择题形式,难以全面和精准地评估模型在儿科医疗场景中的性能。为此,提出首个中文儿科医疗问答基准测试方法——PeMeBench。该方法基于双视角评估维度,参考来自10个儿科疾病系统的诊疗规范类书籍,将儿科医疗问答任务细分为疾病知识、治疗方案、用药剂量、疾病预防和药理作用5个儿科医疗问答子任务,构建超1万个开放式的问答题目,引入一种融合实体召回和检测语句幻觉的多粒度自动化评估方案,旨在对大语言模型在儿科基础医疗领域中的性能进行全面、准确的评估,深入剖析其潜在局限性,为提升医疗服务的智能化水平奠定坚实的基础。Large language model(LLM)has demonstrated significant application potential in the medical field.However,evaluating the performance of LLM in medical scenarios poses a challenge.Existing medical benchmarks,predominantly in the form of multiple-choice questions,struggle to comprehensively and accurately assess LLM's performance in pediatric domains.To address this issue,PeMeBench,the first Chinese pediatric question-answering benchmark,was proposed.Leveraging a dual-perspective evaluation dimensions and referencing diagnostic and treatment guidelines from 10 pediatric disease systems,PeMeBench meticulously categorized pediatric medical question-answering tasks into five subdomains:disease knowledge,treatment plans,medication dosages,disease prevention,and pharmacological effects.It comprised over 10000 open-ended question-answering items and introduced a multi-grained automated evaluation scheme that integrated entity retrieval with the detection of hallucinated sentences.This approach aimed to provide a comprehensive and precise assessment of LLM's performance in pediatric healthcare,delving into their potential limitations and laying a solid foundation for enhancing the intelligence level of medical services.

关 键 词:儿科医疗 基准测试 大语言模型 问答 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象