检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:孙海霞[1,2] 王蕾 吴英杰[2] 华薇娜[1] 李军莲[2] Sun Haixia;Wang Lei;Wu Yingjie;Hua Weina;Li Junlian(School of Information Management, Nanjing University, Nanjing 210093, China;Institute of Medical Infomaation, Chinese Academy of Medical Sciences, Beijing t 00020, China)
机构地区:[1]南京大学信息管理学院,南京210093 [2]中国医学科学院医学信息研究所,北京100020
出 处:《数据分析与知识发现》2018年第8期88-97,共10页Data Analysis and Knowledge Discovery
基 金:中央级公益性科研院所基本科研业务费专项"基于共现分析的著者机构名称规范机制研究"(项目编号:2016RC330006);国家科技图书文献中心"下一代国家科技创新开放知识服务系统"先期研发任务"STKOS自动构建与维护关键技术研究"(项目编号:XQYF0102)的研究成果之一
摘 要:【目的】规范科技文献数据库中机构名称存储与管理,设计并实现机构名称匹配策略。【方法】引入地区、类别和命名特征,构建3类7组匹配判定规则,设计4组规则与编辑距离混合的匹配策略,基于中文生物医学文献数据库2006年–2011年"作者单位"数据进行实现与评估。【结果】在600余万条"作者单位"数据集上,对高等院校、医院与科研院所三类机构进行匹配实现,结果表明综合考虑机构地区和命名特征规则的混合匹配策略表现最佳,准确率均在80%以上,召回率达64.82%,F值达71.66%。【局限】辅助词典和规则构建主要依赖人工经验,覆盖面不全;机构名称识别存在错误,对匹配结果产生影响;提出的匹配策略无法有效解决机构名称形态差异较大的规范问题。【结论】本研究提出一种基于规则和编辑距离的机构名称匹配策略,能够提高科研文献数据库建设的规范性。[Objective] This paper designs and implements matching strategies for institution names in literature database, aiming to regulate their storage and management. [Methods] We first established seven name matching rules based on their regions, types and naming characteristics. Then, we designed four hybrid matching strategies combining rules and Levenstein distance. Finally, we evaluated the four hybrid strategies with institution names from the papers indexed by Chinese Biomedical Literature (CBM) database during 2006-2011. [Results] More than six million affiliation strings from CBM were matched, which included higher education institutions, hospitals and research institutes. We found that the hybrid matching strategy based on region, naming characteristics and Levenstein distance obtained the highest precision (all above 80%), recall (64.82%), and F-value (71.66%). [Limitationsl The rules and related dictionaly were mainly constructed with human experience and their coverage is limited. There are some errors in the identifying institution names. The proposed strategy cannot address the issues caused by the transformative actions of institutions. [Conclusions] The proposed strategies could improve the performance of scientific research literature databases.
关 键 词:信息检索 机构名称规范 相似度计算 混合策略 文献数据库
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229