新冠领域溯源类论文筛选及全文实体标注研究  

Selection of Papers on the Origins of COVID-19 and Entity Annotation Based on Full Texts

在线阅读下载全文

作  者:徐硕 张萌萌 柳力元 王聪聪 孙睿 李怡琳 徐金楠 安欣[2] XU Shuo;ZHANG Mengmeng;LIU Liyuan;WANG Congcong;SUN Rui;LI Yilin;XU Jinnan;AN Xin(School of Economics and Management,Beijing University of Technology,Beijing 100124;School of Economics and Management,Beijing Forestry University,Beijing 100083)

机构地区:[1]北京工业大学经济与管理学院,北京100124 [2]北京林业大学经济管理学院,北京100083

出  处:《农业图书情报学报》2023年第1期87-98,共12页Journal of Library and Information Science in Agriculture

基  金:国家自然科学基金项目“基于全文本的微观实体扩散机制研究”(72004012);北京工业大学2022年度“研究生思政教育进科研团队——抗疫专项探索项目”。

摘  要:[目的/意义]新冠病毒出现以来,国内外与新冠病毒研究相关的论文迅猛增长。整理国内外COVID-19相关学术论文,创建针对新冠溯源类论文的数据集和细粒度的实体数据集能为新冠病毒的起源和传播机理等相关研究提供坚实的数据支撑。[方法/过程]提出基于主动学习模型的论文筛选方法,从海量论文中高效精准地定位与新冠溯源相关的论文。同时,设计了一种新冠领域18类实体的标注方案,不仅包含生物领域通有的基因、蛋白质和化合物等实体,还涵盖新冠领域特有的冠状病毒、野生动物等实体。[结果/结论]构建了一个新冠溯源类论文数据集,共包含885篇文章;基于提出的实体标注方案,标注全文本论文99篇,构建了一个细粒度的实体数据集,包含39118个实体,是目前新冠领域规模最大、最全面的实体标注数据集。[Purpose/Significance]Since the outbreak of COVID-19,there has been a rapid increase in the number of studies related to COVID-19 at home and abroad.Review of relevant literature on COVID-19 provides data resources for related research on the emergence and transmission mechanism of SARS-CoV-2.However,the current COVID-19 related dataset is a collection of the literature,without classifying the data for each subfield,and the coarse-grained information such as the title and author fails to provide an in-depth understanding of the progress of COVID-19 research.Therefore,this paper created a dataset for the COVID-19 sub-domain and a fine-grained entity dataset.[Method/Process]Firstly,this paper proposed a literature screening method based on active learning model,which can obtain more valuable marker samples with less labor cost,so that the classifier has better generalization performance.We considered three base classifiers:Support Vector Machine(SVM),Logistic Regression(LR),and Random Forest(RF),while considering four query strategies:uncertainty sampling,expected error reduction,committee-based query,and random sampling.Taking the origin of SARS-CoV-2,one of the sub-fields related to SARS-CoV-2,as an example,articles related to the origin of SARS-CoV-2 were efficiently and accurately located from the literature.At the same time,this paper designed a labeling scheme covering 18 types of entities,including not only genes,proteins,compounds and other entities that are universal in the biological field,but also corona viruses and wild animals that are unique to the field of SARS-CoV-2.In this paper,visual annotation tool BRAT was used for entity annotation.The tagging team consisted of an administrator and six annotators,and the entity tagging consisted of two rounds.What's more,multi-k consistency index was used to calculate the consistency score of annotation results.[Results/Conclusions]The results of the active learning model show that the uncertain sampling query strategy has the best performance.SVM,LR and RF ba

关 键 词:新冠病毒 数据收集 SARS-CoV-2起源 文档筛选 实体标注 

分 类 号:G255.51[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象