基于BERT与主题模型联合增强的长文档检索模型  被引量:3

Long document retrieval model based on the joint enhancement of BERT and topic model

在线阅读下载全文

作  者:覃俊[1] 刘璐 刘晶[1] 叶正 张泽谨 QIN Jun;LIU Lu;LIU Jing;YE Zheng;ZHANG Zejin(College of Computer Science&Hubei Provincial Engineering Research Center for Intelligent Management of Manufacturing Enterprises&Hubei Provincial Engineering Research Center of Agricultural Blockchain and Intelligent Management,South-Central Minzu University,Wuhan 430074,China)

机构地区:[1]中南民族大学计算机科学学院&湖北省制造企业智能管理工程技术研究中心&农业区块链与智能管理湖北省工程研究中心,武汉430074

出  处:《中南民族大学学报(自然科学版)》2023年第4期469-476,共8页Journal of South-Central University for Nationalities:Natural Science Edition

基  金:国家民委中青年英才培养计划项目(MZR20007);湖北省科技重大专项(2020AEA011);武汉市科技计划应用基础前沿项目(2020020601012267)。

摘  要:尽管将BERT运用在Ad-hoc文档检索领域能够提升任务精确度,但也存在两个显著缺陷:第一,由于BERT存在输入限制,对长文档进行截断会导致文档信息丢失;第二,Ad-hoc文档检索任务的数据集中存在相当数量的领域特定词,而BERT不能较好地学习这些特定词的特征.而利用LDA主题模型不存在输入限制,可以表示完整的语义信息的优点,将其引入联合增强模型,且对文档中的领域特定词及语义内涵进行学习表征,弥补了BERT模型的不足.为此提出RWT-BERT联合增强模型通过对BERT和LDA主题模型的表征构建交互网络,对查询语句和长文档进行更深层次的特征挖掘.实验结果表明:该模型在3个数据集的主要指标上都有不同程度的提升,尤其在Core17数据集上,与目前效果最好的句子级Ad-hoc文档检索模型Birch相比,nDCG@20指标提高了4.01%.BERT has been widely used in the field of Ad-hoc document retrieval,it effectively improves task accuracy but also brings two defects that are hard to ignore.First,due to the input limitation of BERT,truncation of long documents causes the problem of document information loss.And another defect is that there is a significant number of domain-specific words in Ad-hoc document retrieval task datasets,but BERT can not learn the features of these domain-specific words well.In this paper,LDA topic model has no input restrictions and is able to represent complete semantic information,which is introduced into the joint enhancement model to learn and represent domain-specific words and semantic connotations in documents making up for the deficiency of BERT.The RWT-BERT proposed in this paper constructs an interactive network through the representation of BERT and LDA topic model,and carries out deeper feature mining for query statements and long documents.Experimental results show that this model improves the main indicators of three datasets with different degrees,especially in Core17 dataset.Compared with Birch,the most effective sentence-level Ad-hoc document retrieval model,nDCG@20 index is improved by 4.01%.

关 键 词:文档检索 预训练模型 长文档 主题模型 信息检索 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象