PeMeBench:中文儿科医疗问答基准测试方法

PeMeBench:Chinese pediatric medical Q&A benchmark testing method

作　　者：张芊陈攀峰冯林坤刘淑钰马丹陈梅[1,2] 李晖 ZHANG Qian;CHEN Panfeng;FENG Linkun;LIU Shuyu;MA Dan;CHEN Mei;LI Hui(State Key Laboratory of Public Big Data,Guiyang 550000,China;College of Computer Science and Technology,Guizhou University,Guiyang 550000,China)

机构地区：[1]公共大数据国家重点实验室,贵州贵阳550000 [2]贵州大学计算机科学与技术学院,贵州贵阳550000

出　　处：《大数据》2024年第5期28-44,共17页Big Data Research

基　　金：国家自然科学基金项目(No.61462012);2023年贵州省科技计划项目(黔科合支撑[2023]一般276);2023年贵州省科技成果应用及产业化计划项目(黔科合成果[2023]一般010)。

摘　　要：大语言模型在医疗领域显现出巨大的应用潜力,如何评估其在医疗领域中的性能成为挑战。现有医疗评测基准测试多为选择题形式,难以全面和精准地评估模型在儿科医疗场景中的性能。为此,提出首个中文儿科医疗问答基准测试方法——PeMeBench。该方法基于双视角评估维度,参考来自10个儿科疾病系统的诊疗规范类书籍,将儿科医疗问答任务细分为疾病知识、治疗方案、用药剂量、疾病预防和药理作用5个儿科医疗问答子任务,构建超1万个开放式的问答题目,引入一种融合实体召回和检测语句幻觉的多粒度自动化评估方案,旨在对大语言模型在儿科基础医疗领域中的性能进行全面、准确的评估,深入剖析其潜在局限性,为提升医疗服务的智能化水平奠定坚实的基础。Large language model(LLM)has demonstrated significant application potential in the medical field.However,evaluating the performance of LLM in medical scenarios poses a challenge.Existing medical benchmarks,predominantly in the form of multiple-choice questions,struggle to comprehensively and accurately assess LLM's performance in pediatric domains.To address this issue,PeMeBench,the first Chinese pediatric question-answering benchmark,was proposed.Leveraging a dual-perspective evaluation dimensions and referencing diagnostic and treatment guidelines from 10 pediatric disease systems,PeMeBench meticulously categorized pediatric medical question-answering tasks into five subdomains:disease knowledge,treatment plans,medication dosages,disease prevention,and pharmacological effects.It comprised over 10000 open-ended question-answering items and introduced a multi-grained automated evaluation scheme that integrated entity retrieval with the detection of hallucinated sentences.This approach aimed to provide a comprehensive and precise assessment of LLM's performance in pediatric healthcare,delving into their potential limitations and laying a solid foundation for enhancing the intelligence level of medical services.

关键词：儿科医疗基准测试大语言模型问答

分类号：TP399[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

PeMeBench:中文儿科医疗问答基准测试方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

PeMeBench:中文儿科医疗问答基准测试方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索