基于先精确后召回策略的作者名消歧模型研究  被引量:2

A Model of Author Name Disambiguation Based on the Strategy of Targeting Precision before Recall

在线阅读下载全文

作  者:沈喆 王毅[1] 鞠秀芳 成颖[1] Shen Zhe;Wang Yi;Ju Xiufang;Cheng Ying(School of Information Management,Nanjing University,Nanjing 210023;Institute for Chinese Social Sciences Research and Assessment,Nanjing University,Nanjing 210093)

机构地区:[1]南京大学信息管理学院,南京210023 [2]南京大学中国社会科学研究评价中心,南京210093

出  处:《情报学报》2022年第4期350-363,共14页Journal of the China Society for Scientific and Technical Information

基  金:国家社会科学基金项目“学术文献颠覆性创新评价的理论及实证研究”(20BTQ086)。

摘  要:学者完整且准确的学术成果集为科学计量与科研人才评价等研究提供了重要的数据基础。在现有基于机器学习模型的作者姓名消歧方法尚未达到实用要求的背景下,本研究面向高层次科研人才,充分利用基于规则方法精确率高的优势,提出了“先面向精确率,后面向召回率”的“两步法”作者姓名消歧模型。得益于该群体易于从网络中搜集其履历、研究方向和代表作等信息,消歧模型可采用的特征更加丰富,从而保证了消歧模型的优异性能。本研究以国家杰出青年科学基金获得者为例对模型进行了验证,结果表明,本研究提出的高层次科研人才作者名消歧模型在精确率与召回率两个方面均表现良好,在两组不同特征集上的F1值分别达到了0.93和0.95,较基线模型有较大提升。Collecting the complete and accurate academic output of each scholar provides the fundamental data needed for bibliometrics and scientific evaluation research. Since the existing author name disambiguation(AND) techniques have not met the demand of practical application, this paper proposes a two-step AND model based on rules for high-level scientific talents that takes full advantage of a rule-based model with high precision and adopts a strategy of targeting precision before recall. Since more features were used due to the feasibility of collecting external data of high-level researchers that contain resumes, representative work, and research interests, the proposed method showed excellent performance. The method was tested with data from the National Science Fund for Distinguished Young Scholars. The experimental results showed that the proposed method performed well both in precision and recall. The F1 score was 0.93 and 0.95 based on two feature sets that were obviously better than the baseline model.

关 键 词:作者名消歧 规则消歧 高层次科研人才 两步法 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象