基于公式化表达脱敏与边界识别加强的学术论文研究问题与方法识别研究  被引量:2

Identification of Problem and Method in Scientific Papers Based on Formulaic Expression Desensitization and Enhanced Boundary Recognition

在线阅读下载全文

作  者:张颖怡 章成志[2] Zhang Yingyi;Zhang Chengzhi(Department of Archives and E-government,School of Social Science,Soochow University,Suzhou 215123;Department of Information Management,School of Economics and Management,Nanjing University of Science and Technology,Nanjing 210094)

机构地区:[1]苏州大学社会学院档案与电子政务系,苏州215123 [2]南京理工大学经济管理学院信息管理系,南京210094

出  处:《情报学报》2024年第6期712-732,共21页Journal of the China Society for Scientific and Technical Information

基  金:国家自然科学基金项目“基于学术文献全文内容的细粒度算法实体抽取与评估研究”(72074113)。

摘  要:研究问题和方法是学术论文中的重要组成部分,其在学术论文组织、管理与检索以及科研成果评价中具有重要意义。为缓解研究问题与方法识别中存在的公式化表达依赖和词语边界识别错误等问题,本文提出一种联合公式化表达脱敏和边界识别加强的模型。具体地,公式化表达脱敏使用数据增强方法实现,边界识别加强使用指针网络与序列标注模型实现。随着学术论文的开放获取,学术论文全文被研究者用于实体识别任务中。为证明使用学术论文全文的必要性,本文人工构建了自然语言处理领域的摘要和全文标注数据集,同时设计了数值和内容指标,用于分析两类数据集中的问题和方法识别结果以及问题与方法关系对抽取结果的差异。十折交叉实验结果表明,本文模型的宏平均F1值优于SciBERT-BiLSTM-CRF基线模型3.69个百分点且存在显著性差异。根据摘要与全文实体识别和关系对抽取结果的对比,发现摘要中包含的问题与方法实体的表意较宽泛,全文中具有更多描述模型设计和训练细节的实体和关系对。Problems and methods are crucial components of scientific papers and play a significant role in the organiza‐tion,management,retrieval,and evaluation of scientific papers.To alleviate the formulaic expression dependency and word boundary recognition errors in problem and method recognition methods,we propose a model combined with formu‐laic expression desensitization and enhanced boundary recognition.Specifically,formulaic expression desensitization is achieved through data augmentation methods,whereas boundary enhancement utilizes pointer networks and sequence la‐beling models.With open access to scientific papers,researchers are utilizing full-text papers for entity recognition tasks.To demonstrate the importance of using full-text papers,this paper manually constructs an abstract and full-text annotated dataset in the field of natural language processing.Numerical and content-based metrics are designed to compare the prob‐lem,method,and their relationship extracted from two datasets.The results of ten-fold cross-validation experiments indi‐cate that the proposed model outperforms baseline models such as SciBERT-BiLSTM-CRF significantly,with a macro-av‐erage F1 score improvement of 3.69 percentage points.When comparing entity recognition and relationship extraction re‐sults between abstracts and full texts,this paper shows that problem and method entities in abstracts have a broader seman‐tic representation,whereas full texts contain more detailed entities and relationships that describe model design and train‐ing procedures.

关 键 词:知识实体识别 研究问题和方法识别 指针网络 数据增强 

分 类 号:G353.1[文化科学—情报学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象