一种基于成对字向量和噪声鲁棒学习的同义词挖掘算法被引量：1

A Synonym Mining Algorithm Based on Pair-wise Character Embedding and Noisy Robust Learning

作　　者：张浩宇王戟[1] ZHANG Hao-Yu;WANG Ji(State Key Laboratory of High Performance Computing,National University of Defense Technology,Changsha 410072;Artificial Intelligence Research Center,Defense Innovation Institute,Beijing 100071)

机构地区：[1]国防科技大学高性能计算国家重点实验室,长沙410072 [2]军事科学院国防科技创新研究院人工智能研究中心,北京100071

出　　处：《自动化学报》2023年第6期1181-1194,共14页Acta Automatica Sinica

基　　金：国家重点研发计划(2017YFB1001802);国家自然科学基金(91948303,62032024)资助。

摘　　要：同义词挖掘是自然语言处理中一项重要任务.为了构建大规模训练语料,现有研究利用远程监督、点击图筛选等方式抽取同义词种子,而这几种方式都不可避免地引入了噪声标签,从而影响高质量同义词挖掘模型的训练.此外,由于大量实体词所具有的少样本特性、领域分布差异性和预训练词向量训练目标与同义词挖掘任务的不一致性,在同义词挖掘任务中,词级别的预训练词向量很难产生高质量的实体语义表示.为解决这两个问题,提出了一种利用成对字向量和噪声鲁棒学习框架的同义词挖掘模型.模型利用预训练的成对字向量增强实体语义表示,并利用自动标注的噪声标签通过交替优化的方式,估计真实标签的分布并产生伪标签,希望通过这些改进提升模型的表示能力和鲁棒性.最后,使用WordNet分析和过滤带噪声数据集,并在不同规模、不同领域的同义词数据集上进行了实验验证.实验结果和分析表明,该同义词挖掘模型在各种数据分布和噪声比例下,与有竞争力的基准方法相比,均提升了同义词判别和同义词集合生成的效果.Synonym mining is an important task in natural language processing.In order to construct large-scale training corpus,existing studies extract synonym seeds using distant supervision and click graph filtering,which inevitably introduce noisy labels,thus affecting the training of high-quality synonym mining models.In addition,due to the few-shot and domain-distribution-shift property of most entity words,and the inconsistency between the training objective of the pre-trained word embeddings and the synonym mining task,it is difficult for the pretrained word embeddings in the synonym mining task to produce high-quality entity semantic representations.To address these two issues,this paper proposes a synonym mining model that utilizes pair-wise character embeddings and a noise robust learning framework.The model uses pre-trained pair-wise character embeddings to enhance the entity semantic representations,estimate true label distribution and generate pseudo-labels through a joint optimization process.We want to improve the representation ability and robustness of the model through these improvements.Finally,we use WordNet to analyze and filter noisy datasets and conduct the experiments on synonym datasets of different sizes and domains.The experimental results show that the proposed synonym mining model improves the synonym set-instance classification and set generation performances compared to competitive benchmark methods under different data distribution and noise ratios.

关键词：同义词挖掘噪声标签学习自然语言处理成对字向量信息抽取

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于成对字向量和噪声鲁棒学习的同义词挖掘算法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于成对字向量和噪声鲁棒学习的同义词挖掘算法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种基于成对字向量和噪声鲁棒学习的同义词挖掘算法被引量：1