一种面向医学文本数据的结构化信息抽取方法被引量：16

Approach of Structured Information Extraction for Medical Text Data

作　　者：杨兵聂铁铮申德荣寇月于戈 YANG Bing;NIE Tie-zheng;SHEN De-rong;KOU Yue;YU Ge(School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China)

机构地区：[1]东北大学计算机科学与工程学院

出　　处：《小型微型计算机系统》2019年第7期1479-1485,共7页Journal of Chinese Computer Systems

基　　金：国家重点研究发展计划项目(2018YFB1003404)资助;国家自然科学基金项目(61672142,61402213,U1435216)资助;中央大学基础研究基金项目(N150408001-3,N150404013)资助

摘　　要：医学文本作为医疗领域重要的信息载体,为临床诊断和病理学研究提供了重要的数据支持,然而使用自然语言编写的文本数据往往是非结构化的,不便于机器理解和自动化处理.对于中文的医学文本数据而言,由于专业性强,需要丰富的领域知识,并且语法上多采用短句形式,这给结构化信息的抽取带来了巨大的挑战.为此,本文设计了一种针对医学领域的文本数据进行结构化信息抽取的方法,该方法首先通过文本聚类和关键词提取来获得医学描述语言中常用的表达术语,然后使用生成的医学术语库辅助中文分词处理,以提高中文医学文本的分词质量.然后,分析词与词之间的语义依存关系并随之构建依存句法树.最后,从该句法树中识别和抽取医学文本描述中的关键指标及其对应的指标值,最终得到结构化的键值对数据.本文采用真实的医学影像报告文本作为实验数据,实验结果表明该方法有效提高了中文医学文本的分词质量,准确率最高可达98.24%,并在结构化的信息抽取中效果显著,具有最高83.76%的准确率和88.09%的召回率.本文提出的方法能覆盖多种依存语法,且有很好的适用性.As an important information carrier in the medical field,texts provide important data which support for clinical diagnosis and pathological research.However,texts written with the natural language are often unstructured and difficult for understanding and automatic processing.Especially for medical texts in Chinese,due to its strong professionalism,which requires extensive domain knowledge,and many short sentences are used in grammar which brings more difficulties for information extraction.Therefore,this paper proposes an approach for extracting structured information from medical text data.This approach firstly uses text clustering and keywords extraction to get commonly used expression terms in medical descriptions,and then generates the medical term database to assist Chinese segmentation to improve quality of segmentation in Chinese medical texts.Then,we analyze semantic dependency between words,and construct syntactic dependency trees for identifying and extracting key indicators with the corresponding value in medical texts from these syntactic dependency trees to obtain the structured output data.We use texts data of medical image reports for experiments,and experimental results show that this approach can effectively improve the quality of Chinese word segmentation,with the accuracy up to 98.24%.Moreover,there are significant effects in structured knowledge extraction,with the most accuracy of 83.76%and recall of 88.09%.In addition,this approach can cover a variety of dependency grammar,thus has a good applicability.

关键词：结构化信息抽取文本聚类关键词提取语义依存

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向医学文本数据的结构化信息抽取方法被引量：16

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种面向医学文本数据的结构化信息抽取方法 被引量：16

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种面向医学文本数据的结构化信息抽取方法被引量：16