基于渐进机器学习的中文问句匹配方法

Question-matching approach based on gradual machine learning

作　　者：贺学剑陈安琪郭志强王致茹陈群 HE Xuejian;CHEN Anqi;GUO Zhiqiang;WANG Zhiru;CHEN Qun(Henan Forestry Vocational College,Luoyang 471002,China;School of Software,Northwestern Polytechnical University,Xi’an 710072,China;School of Computer Science,Northwestern Polytechnical University,Xi’an 710072,China)

机构地区：[1]河南林业职业学院,洛阳471002 [2]西北工业大学软件学院,西安710072 [3]西北工业大学计算机学院,西安710072

出　　处：《工程科学学报》2025年第1期79-90,共12页Chinese Journal of Engineering

基　　金：国家自然科学基金面上资助项目(62172335)。

摘　　要：问句匹配旨在判断不同问句的意图是否相近.近年来,随着大型预训练语言模型的发展,利用其挖掘问句对在语义层面隐含的匹配信息,取得了目前为止最好的性能.然而,由于基于独立同分布假设,在真实场景中,这些深度学习模型的性能仍然受制于训练数据的充足程度和目标数据与训练数据之间的分布漂移.本文提出一种基于渐进机器学习的中文问句匹配方法.该方法基于渐进机器学习框架,从不同角度提取问句特征,构建融合各类特征信息的因子图,然后通过迭代的因子推理实现从易到难的渐进学习.在特征建模中,设计并实现了两种类型特征的提取:(1)基于TF-IDF(Term frequency-inverse document frequency)的关键词特征;(2)基于DNN(Deep neural network)的深度语义特征.最后,通过通用的基准中文数据集LCQMC和BQ corpus验证了所提方法的有效性.实验表明,相比于单纯的深度学习模型,基于渐进机器学习的方法可以有效提升问句匹配的准确率,且其性能优势随着标签训练数据的减少而增大.Question matching attempts to determine whether the intentions of two different questions are similar.Recently,with the development of large-scale pretrained DNN(Deep neural network)language models,state-of-the-art question-matching performance has been achieved.However,due to the independent and identically distributed assumption,the performance of these DNN models in realworld scenarios is limited by the adequacy of the training data and the distribution drift between the target and training data.In this study,we propose a novel gradual machine learning(GML)-based approach for Chinese question matching.Beginning with initially labeled instances,this approach gradually labels target instances in order of increasing hardness via iterative factor inference on a factor graph.The proposed solution first extracts diverse semantic features from different perspectives and then constructs a factor graph by fusing the extracted features to facilitate gradual learning from easy to hard.In feature modeling,we extract and model two complementary types of features:1)TF-IDF-based keyword features,which can capture the shallow semantic similarity between two questions;2)DNN-based deep semantic features,which can capture the latent semantic similarity between two questions.We model keyword features as unary factors in a factor graph,which define their influence on the matching status of the two questions.The DNNbased features contain global and local features,where the global features correspond to a question pair’s matching probability as estimated by a DNN model,and the local features correspond to the semantic similarity between two neighboring question pairs estimated by their vector representations in a DNN’s embedding space.To facilitate gradual inference,we model the DNN-based global and local features as unary and binary factors,respectively,in a factor graph.Finally,we implement a GML solution for question matching based on an open-sourced GML inference engine.We validated the efficacy of the proposed approach thro

关键词：自然语言理解中文问句匹配渐进机器学习自然语言预训练模型因子图推理

分类号：TG319[金属学及工艺—金属压力加工]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于渐进机器学习的中文问句匹配方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于渐进机器学习的中文问句匹配方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索