基于预训练模型软提示微调的无监督短语抽取方法  

Unsupervised phrase extraction method based on pre-trained model with soft prompt tuning

作  者:龙彪 线岩团[2] 郭军军 黄于欣 LONG Biao;XIAN Yantuan;GUO Junjun;HUANG Yuxin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)

机构地区:[1]昆明理工大学信息工程与自动化学院,云南昆明650500 [2]昆明理工大学云南省人工智能重点实验室,云南昆明650500

出  处:《微电子学与计算机》2025年第1期17-25,共9页Microelectronics & Computer

基  金:国家自然科学基金(62266028);云南重大科技专项(202202AD080003)。

摘  要:关键短语是文章中含有重要信息的词语或短语,能够概括文章的主题和主要内容。关键短语抽取则是信息检索和文本搜索领域的重要任务。目前主流的短语抽取方法是多段式的,其中第一阶段的候选短语选取对结果有较大的影响。由于预训练语言模型没有专门针对短语抽取任务进行设计,所以单纯的通过嵌入比较无法准确衡量短语和文档之间的相关性。针对上述问题,提出了一种利用软提示微调进行无监督关键短语抽取的方法。首先,引入前缀向量对单词的噪声信息和语义信息进行建模,并通过线性变换对预训练模型的输出做进一步的特征提取。其次,通过KL散度加大单词在这两种信息上的差异化,使用方差损失防止模型出现坍缩。最后,以两种信息差异化的程度一步式确定单词的重要性得到关键短语。在Inspec和SemEval2017数据集上进行了模型有效性实验,结果表明,与现有方法比较,F1分数平均提升1%。Key phrases encompass words or phrases pivotal in conveying essential information within an article,succinctly encapsulating its core themes and primary content.Key phrase extraction stands as a fundamental task within the realms of information retrieval and text search.Current mainstream methods for phrase extraction typically involve multiple stages,where the initial selection of candidate phrases significantly influences the outcome.Moreover,pre-trained language models,while potent,are not inherently tailored for the task of phrase extraction,and mere embedding comparisons may inadequately capture the correlation between phrases and documents.Addressing these challenges,an unsupervised key phrase extraction approach employing soft prompt tuning is proposed.Firstly,prefix vectors are introduced to model the noise and semantic information of words,followed by feature extraction through linear transformation of the pre-trained model output.Secondly,KL divergence is employed to amplify the disparity between the two types of information in words,while variance loss serves to mitigate model collapse.Finally,the importance of words is determined in a single step based on the degree of differentiation between the two types of information,yielding key phrases.Comparative evaluation against existing methods on the Inspec and SemEval2017 datasets demonstrates an average 1%increase in the F1 score.

关 键 词:短语抽取 软提示微调 一步式 信息分数差 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象