融合语义增强和知识蒸馏的学术文献知识实体抽取方法研究  

Knowledge Entity Extraction Method Combining Semantic Enhancement and Knowledge Distillation for Academic Literature

在线阅读下载全文

作  者:王玉龙 秦春秀[1,2] 马续补 吕树月[1,2] 李凡 Wang Yulong;Qin Chunxiu;Ma Xubu;Lyu Shuyue;and Li Fan(School of Economics and Management,Xidian University,Xi’an 710126;Shaanxi Information Resources Research Center,Xi’an 710126)

机构地区:[1]西安电子科技大学经济与管理学院,西安710126 [2]陕西信息资源研究中心,西安710126

出  处:《情报学报》2025年第4期438-451,共14页Journal of the China Society for Scientific and Technical Information

基  金:国家社会科学基金重点项目“场景驱动的我国关键核心领域文献资源精细组织与精准服务模式研究”(22ATQ002)。

摘  要:准确识别和提取海量学术文献中蕴含的各类知识实体,对于精准满足科研人员的知识需求、促进细粒度知识发现具有重要意义。针对学术文献中领域知识实体数据稀疏和不均衡等问题,本研究提出一种融合语义增强和知识蒸馏的知识实体抽取改进方案。首先,本研究提出语义增强的教师模型。一方面,通过构建融合SciBERT(bidirectional encoder representations from transformers for scientific text)和ELMo(embeddings from language models)模型的嵌入表示方法,将全局语义与动态词义信息相结合生成更加全面的语义表示,从而提升教师模型对领域学术文献复杂上下文的建模能力;另一方面,基于领域预训练词嵌入模型筛选出与知识实体语义关联度最高的Top n单词或短语,并结合注意力和门控机制对增强的实体语义信息进行动态加权,以有效缓解实体数据稀疏和长尾类别建模的不足。其次,采用一组异构的单一实体教师模型,生成不同教师模型在聚合数据集下的概率分布结果,并以此来指导学生模型的训练。最后,本研究利用材料科学领域的三个公开数据集验证所提方法的有效性。实验结果表明,所提方法在材料科学领域的三个数据集上均取得了最高的micro F1和macro F1,并且在实体数据稀疏和不均衡等情境下,具有显著的鲁棒性和泛化能力。The accurate identification and extraction of diverse knowledge entities from large volumes of academic literature is crucial for meeting the needs of researchers and advancing fine-grained knowledge discovery.To address the issues of data sparsity and imbalances in domain-specific entities within academic literature,this study proposes an improved method that combines semantic enhancement and knowledge distillation.First,this method introduces a semantic-enhanced teacher model.By constructing an embedding representation method that integrates SciBERT,a pretrained language model based on BERT (bidirectional encoder representations from transformers), and ELMo (embeddings from lan‐guage models), global semantics and dynamic word-level information are effectively combined. This approach generates more comprehensive semantic representations. Hence, it enhances the ability of the teacher model to capture complex con‐textual information in domain-specific academic literature. Moreover, a domain-specific pre-trained word embedding mod‐el is used to select the top n words or phrases that are most semantically related to the knowledge entities. Attention and gating mechanisms are then applied to dynamically weight the enhanced semantic information, thus effectively addressing data sparsity and the challenge of modeling long-tail entity categories. Next, a set of heterogeneous single-entity teacher models is employed to generate probability distributions across the aggregated dataset. These distributions are then used to guide the training of a student model. Finally, this study validates the effectiveness of the proposed method using three pub‐licly available datasets from the field of materials science. Experimental results demonstrated that the proposed method achieved the highest micro F1 and macro F1 scores across three datasets in the field of materials science. Moreover, the proposed method exhibits significant robustness and generalization capabilities, particularly under scenarios of entity data sparsity

关 键 词:语义增强 知识蒸馏 知识实体抽取 学术文献 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象