科技文献数据库中机构名称匹配策略研究  被引量:12

Matching Strategies for Institution Names in Literature Database

在线阅读下载全文

作  者:孙海霞[1,2] 王蕾 吴英杰[2] 华薇娜[1] 李军莲[2] Sun Haixia;Wang Lei;Wu Yingjie;Hua Weina;Li Junlian(School of Information Management, Nanjing University, Nanjing 210093, China;Institute of Medical Infomaation, Chinese Academy of Medical Sciences, Beijing t 00020, China)

机构地区:[1]南京大学信息管理学院,南京210093 [2]中国医学科学院医学信息研究所,北京100020

出  处:《数据分析与知识发现》2018年第8期88-97,共10页Data Analysis and Knowledge Discovery

基  金:中央级公益性科研院所基本科研业务费专项"基于共现分析的著者机构名称规范机制研究"(项目编号:2016RC330006);国家科技图书文献中心"下一代国家科技创新开放知识服务系统"先期研发任务"STKOS自动构建与维护关键技术研究"(项目编号:XQYF0102)的研究成果之一

摘  要:【目的】规范科技文献数据库中机构名称存储与管理,设计并实现机构名称匹配策略。【方法】引入地区、类别和命名特征,构建3类7组匹配判定规则,设计4组规则与编辑距离混合的匹配策略,基于中文生物医学文献数据库2006年–2011年"作者单位"数据进行实现与评估。【结果】在600余万条"作者单位"数据集上,对高等院校、医院与科研院所三类机构进行匹配实现,结果表明综合考虑机构地区和命名特征规则的混合匹配策略表现最佳,准确率均在80%以上,召回率达64.82%,F值达71.66%。【局限】辅助词典和规则构建主要依赖人工经验,覆盖面不全;机构名称识别存在错误,对匹配结果产生影响;提出的匹配策略无法有效解决机构名称形态差异较大的规范问题。【结论】本研究提出一种基于规则和编辑距离的机构名称匹配策略,能够提高科研文献数据库建设的规范性。[Objective] This paper designs and implements matching strategies for institution names in literature database, aiming to regulate their storage and management. [Methods] We first established seven name matching rules based on their regions, types and naming characteristics. Then, we designed four hybrid matching strategies combining rules and Levenstein distance. Finally, we evaluated the four hybrid strategies with institution names from the papers indexed by Chinese Biomedical Literature (CBM) database during 2006-2011. [Results] More than six million affiliation strings from CBM were matched, which included higher education institutions, hospitals and research institutes. We found that the hybrid matching strategy based on region, naming characteristics and Levenstein distance obtained the highest precision (all above 80%), recall (64.82%), and F-value (71.66%). [Limitationsl The rules and related dictionaly were mainly constructed with human experience and their coverage is limited. There are some errors in the identifying institution names. The proposed strategy cannot address the issues caused by the transformative actions of institutions. [Conclusions] The proposed strategies could improve the performance of scientific research literature databases.

关 键 词:信息检索 机构名称规范 相似度计算 混合策略 文献数据库 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象