基于PU学习的科技领域文献集自动降噪方法研究  

Automatic Noise Reduction of Scientific Domain Document Sets Using Positive-Unlabeled Learning

在线阅读下载全文

作  者:陈果[1] 杨泽雨 陈晶 邵雨 Chen Guo;Yang Zeyu;Chen Jing;Shao Yu(Department of Information Management,School of Economics and Management,Nanjing University of Science and Technology,Nanjing 210094)

机构地区:[1]南京理工大学经济管理学院信息管理系,南京210094

出  处:《情报学报》2025年第4期414-424,共11页Journal of the China Society for Scientific and Technical Information

基  金:江苏省社会科学基金项目“不完备文献资源上的科技情报分析方法体系构建”(24TQB001)。

摘  要:在开展领域文献分析时,通过惯用方式构造的文献集普遍存在相当比例的非领域相关文献,降低了最终结果的可靠性,因此,有必要对其开展降噪以剔除杂质。如何实现在无人工标注的前提下开展文献集的自动降噪,是保障降噪方案的领域泛化性、实践应用可行性的必要前提。本文在充分利用原始文献集自身特征的前提下,将领域文献集降噪任务转化为一个在自动构造正负样本集基础上的分类问题;其思路是利用文献集当中自然存在且易识别的一批绝对正样本集,开展PU(positive-unlabeled)分类学习,定位出一批可靠负样本集,以训练最终分类器。本文以人工智能、经济学和免疫学领域MAG(Microsoft Academic Graph)期刊文献集为例,开展了对比实验,比较了降噪方案中选择不同语义表示方法对最终降噪性能的影响,进一步构造了一个基准比较值,引入归一化折扣累积收益这一评价指标,从降噪收益、最终结果可用性以及文献降噪在科技领域情报分析多种任务场景下的有效性3个方面证明了本文方案的有效性。In the domain analysis of science and technology,a considerable proportion of unrelated literature(impurities)exists in the datasets constructed by mainstream methods,which weakens the reliability of the final analysis results.Therefore,noise reduction is essential for removing these impurities.Performing automatic noise reduction on a dataset of domain documents without manual annotation is a prerequisite condition for whether the noise-reduction scheme can be universally applied in practice at a low cost.This study aims to transform the noise reduction task into a classification instead of a clustering problem on the premise of making full use of the characteristics of the original document dataset.We introduce positive-unlabeled(PU)learning,which can be conducted using a group of“absolutely positive samples”available in the domain dataset,to obtain reliable negative samples for the final classifiers to fit.Experiments were conducted on a dataset of journals in the MAG online library in the fields of artificial intelligence,economics,and immunology to not only compare the performance of different schemes but also construct two benchmarks and introduce Normalized Discounted Cumulative Gain as an evaluation metric,which proved the effectiveness of our method from the aspects of noise reduction revenue, usability of the result, and effectiveness of document denoising in the context of scientific and technological information analysis.

关 键 词:领域分析 领域文献集 文献计量 数据集降噪 PU学习 

分 类 号:G250.2[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象