中文糖尿病问题分类体系及标注语料库构建研究  

Construction of Question Taxonomy and An Annotated Chinese Corpus for Diabetes Question Classification

在线阅读下载全文

作  者:钱晓波 谢文秀 龙绍沛 兰牧融 慕媛媛 郝天永 QIAN Xiaobo;XIE Wenxiu;LONG Shaopei;LAN Murong;MU Yuanyuan;HAO Tianyong(School of Computer Science,South China Normal University,Guangzhou,Guangdong 510631,China;Department of Computer Science,City University of Hong Kong,Hong Kong 999077,China;School of Foreign Languages,Chaohu University,Hefei,Anhui 238024,China)

机构地区:[1]华南师范大学计算机学院,广东广州510631 [2]香港城市大学电脑科学系,中国香港999077 [3]巢湖学院外国语学院,安徽合肥238024

出  处:《中文信息学报》2024年第12期54-63,共10页Journal of Chinese Information Processing

基  金:国家社会科学基金(19BYY125)。

摘  要:作为一种典型慢性疾病,糖尿病已成为全球重大公共卫生挑战之一。随着互联网的快速发展,庞大的二型糖尿病患者和高危人群对糖尿病专业信息获取的需求日益突出,糖尿病自动问答服务在患者和高危人群的日常健康服务中也发挥着越来越重要的作用,缺点是缺乏细粒度分类等突出问题。该文设计了一个表示用户意图的新型糖尿病问题分类体系,包括6个大类和23个细类。基于该体系,该文从两个专业医疗问答网站爬取并构建了一个包含122732个问答对的中文糖尿病问答语料库DaCorp,同时对其中的8000个糖尿病问题进行了人工标注,形成一个细粒度的糖尿病标注数据集。此外,为评估该标注数据集的质量,该文实现了8个主流基线分类模型。实验结果表明,最佳分类模型的准确率达到88.7%,验证了糖尿病标注数据集及所提分类体系的有效性。Dacorp、糖尿病标注数据集和标注指南已在线发布,可以免费用于学术研究。As a typical chronic disease,diabetes has become one of the major global public health challenges.The automated diabetes Question Answering(QA)services plays a vital role in providing daily health services for patients and high-risk people.This paper designed a new diabetes question classification taxonomy which represents the user intent,including 6 coarse-grained categories and 23 fine-grained categories.This paper also constructed a new Chinese diabetes QA corpus DaCorp that contains 122,732 questions-answer pairs,collected from two professional medical QA websites.Meanwhile,this paper annotated 8,000 diabetes questions in DaCorp as a fine-grained diabetes dataset.To evaluate the quality of the proposed taxonomy and the annotated dataset,this paper implemented 8 mainstream baseline classifiers for diabetes question classification.Results show that the best-performing model gained an accuracy of 88.7%,demonstrating the validity of the annotated diabetes dataset and the efficacy of the proposed taxonomy.

关 键 词:糖尿病 问题分类 分类体系 语料库建设 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象