检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张浩宇 王戟[1] ZHANG Hao-Yu;WANG Ji(State Key Laboratory of High Performance Computing,National University of Defense Technology,Changsha 410072;Artificial Intelligence Research Center,Defense Innovation Institute,Beijing 100071)
机构地区:[1]国防科技大学高性能计算国家重点实验室,长沙410072 [2]军事科学院国防科技创新研究院人工智能研究中心,北京100071
出 处:《自动化学报》2023年第6期1181-1194,共14页Acta Automatica Sinica
基 金:国家重点研发计划(2017YFB1001802);国家自然科学基金(91948303,62032024)资助。
摘 要:同义词挖掘是自然语言处理中一项重要任务.为了构建大规模训练语料,现有研究利用远程监督、点击图筛选等方式抽取同义词种子,而这几种方式都不可避免地引入了噪声标签,从而影响高质量同义词挖掘模型的训练.此外,由于大量实体词所具有的少样本特性、领域分布差异性和预训练词向量训练目标与同义词挖掘任务的不一致性,在同义词挖掘任务中,词级别的预训练词向量很难产生高质量的实体语义表示.为解决这两个问题,提出了一种利用成对字向量和噪声鲁棒学习框架的同义词挖掘模型.模型利用预训练的成对字向量增强实体语义表示,并利用自动标注的噪声标签通过交替优化的方式,估计真实标签的分布并产生伪标签,希望通过这些改进提升模型的表示能力和鲁棒性.最后,使用WordNet分析和过滤带噪声数据集,并在不同规模、不同领域的同义词数据集上进行了实验验证.实验结果和分析表明,该同义词挖掘模型在各种数据分布和噪声比例下,与有竞争力的基准方法相比,均提升了同义词判别和同义词集合生成的效果.Synonym mining is an important task in natural language processing.In order to construct large-scale training corpus,existing studies extract synonym seeds using distant supervision and click graph filtering,which inevitably introduce noisy labels,thus affecting the training of high-quality synonym mining models.In addition,due to the few-shot and domain-distribution-shift property of most entity words,and the inconsistency between the training objective of the pre-trained word embeddings and the synonym mining task,it is difficult for the pretrained word embeddings in the synonym mining task to produce high-quality entity semantic representations.To address these two issues,this paper proposes a synonym mining model that utilizes pair-wise character embeddings and a noise robust learning framework.The model uses pre-trained pair-wise character embeddings to enhance the entity semantic representations,estimate true label distribution and generate pseudo-labels through a joint optimization process.We want to improve the representation ability and robustness of the model through these improvements.Finally,we use WordNet to analyze and filter noisy datasets and conduct the experiments on synonym datasets of different sizes and domains.The experimental results show that the proposed synonym mining model improves the synonym set-instance classification and set generation performances compared to competitive benchmark methods under different data distribution and noise ratios.
关 键 词:同义词挖掘 噪声标签学习 自然语言处理 成对字向量 信息抽取
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.248