检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:覃俊[1] 刘璐 刘晶[1] 叶正 张泽谨 QIN Jun;LIU Lu;LIU Jing;YE Zheng;ZHANG Zejin(College of Computer Science&Hubei Provincial Engineering Research Center for Intelligent Management of Manufacturing Enterprises&Hubei Provincial Engineering Research Center of Agricultural Blockchain and Intelligent Management,South-Central Minzu University,Wuhan 430074,China)
机构地区:[1]中南民族大学计算机科学学院&湖北省制造企业智能管理工程技术研究中心&农业区块链与智能管理湖北省工程研究中心,武汉430074
出 处:《中南民族大学学报(自然科学版)》2023年第4期469-476,共8页Journal of South-Central University for Nationalities:Natural Science Edition
基 金:国家民委中青年英才培养计划项目(MZR20007);湖北省科技重大专项(2020AEA011);武汉市科技计划应用基础前沿项目(2020020601012267)。
摘 要:尽管将BERT运用在Ad-hoc文档检索领域能够提升任务精确度,但也存在两个显著缺陷:第一,由于BERT存在输入限制,对长文档进行截断会导致文档信息丢失;第二,Ad-hoc文档检索任务的数据集中存在相当数量的领域特定词,而BERT不能较好地学习这些特定词的特征.而利用LDA主题模型不存在输入限制,可以表示完整的语义信息的优点,将其引入联合增强模型,且对文档中的领域特定词及语义内涵进行学习表征,弥补了BERT模型的不足.为此提出RWT-BERT联合增强模型通过对BERT和LDA主题模型的表征构建交互网络,对查询语句和长文档进行更深层次的特征挖掘.实验结果表明:该模型在3个数据集的主要指标上都有不同程度的提升,尤其在Core17数据集上,与目前效果最好的句子级Ad-hoc文档检索模型Birch相比,nDCG@20指标提高了4.01%.BERT has been widely used in the field of Ad-hoc document retrieval,it effectively improves task accuracy but also brings two defects that are hard to ignore.First,due to the input limitation of BERT,truncation of long documents causes the problem of document information loss.And another defect is that there is a significant number of domain-specific words in Ad-hoc document retrieval task datasets,but BERT can not learn the features of these domain-specific words well.In this paper,LDA topic model has no input restrictions and is able to represent complete semantic information,which is introduced into the joint enhancement model to learn and represent domain-specific words and semantic connotations in documents making up for the deficiency of BERT.The RWT-BERT proposed in this paper constructs an interactive network through the representation of BERT and LDA topic model,and carries out deeper feature mining for query statements and long documents.Experimental results show that this model improves the main indicators of three datasets with different degrees,especially in Core17 dataset.Compared with Birch,the most effective sentence-level Ad-hoc document retrieval model,nDCG@20 index is improved by 4.01%.
关 键 词:文档检索 预训练模型 长文档 主题模型 信息检索
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.143.110.248