基于多源信息融合的分布式词表示学习被引量：4

Distributed Word Embedding via Multi-Source Information Fusion

作　　者：冶忠林赵海兴[1,2,3,4] 张科朱宇 YE Zhonglin;ZHAO Hairing;ZHANG Ke;ZHU Yu(College of Computer,Qinghai Normal University,Xining,Qinghai 810008,China;College of Computer Science,Shaanxi Normal University,Xi’an,Shaanxi 710062,China;Provincial Key Laboratory of Tibetan Information Processing and Machine Translation,Xining,Qinghai 810008,China;Key Laboratory of Tibetan Information Processing,Ministry of Education,Xining,Qinghai 810008,China)

机构地区：[1]青海师范大学计算机学院,青海西宁810008 [2]陕西师范大学计算机科学学院,陕西西安710062 [3]青海省藏文信息处理与机器翻译重点实验室,青海西宁810008 [4]藏文信息处理教育部重点实验室,青海西宁810008

出　　处：《中文信息学报》2019年第10期18-30,共13页Journal of Chinese Information Processing

基　　金：国家自然科学基金(11661069,61763041,61663041);长江学者和创新研究团队项目(IRT_15R40);中央高校基本科研业务费专项资金(2017TS045);青海省藏文信息处理与机器翻译重点实验室项目(2013-Z-Y17)

摘　　要：分布式词表示学习旨在用神经网络框架训练得到低维、压缩、稠密的词语表示向量。然而,这类基于神经网络的词表示模型有以下不足:(1)罕见词由于缺乏充分上下文训练数据,训练所得的罕见词向量表示不能充分地反映其在语料中的语义信息;(2)中心词语的反义词出现于上下文时,会使意义完全相反的词却赋予更近的空间向量表示;(3)互为同义词的词语均未出现于对方的上下文中,致使该类同义词学习得到的表示在向量空间中距离较远。基于以上三点,该文提出了一种基于多源信息融合的分布式词表示学习算法(MSWE),主要做了4个方面的改进:(1)通过显式地构建词语的上下文特征矩阵,保留了罕见词及其上下文词语在语言训练模型中的共现信息可以较准确地反映出词语结构所投影出的结构语义关联;(2)通过词语的描述或解释文本,构建词语的属性语义特征矩阵,可有效地弥补因为上下文结构特征稀疏而导致的训练不充分;(3)通过使用同义词与反义词信息,构建了词语的同义词与反义词特征矩阵,使得同义词在词向量空间中具有较近的空间距离,而反义词则在词向量空间中具有较远的空间距离;(4)通过诱导矩阵补全算法融合多源特征矩阵,训练得到词语低维度的表示向量。实验结果表明,该文提出的MSWE算法能够有效地从多源词语特征矩阵中学习到有效的特征因子,在6个词语相似度评测数据集上表现出了优异的性能。Distributed word embedding aims at using neural network framework to learn the low-dimension,compressed and dense representation vectors for words in corpus.This paper proposes a distributed word embedding based on multi-source information fusion(MSWE).In the MSWE algorithm,the main improvements are focused on the following four aspects:(1)Through the explicit construction of context feature matrix,the co-occurrence of rare words and their context words can be retained in the language model,therefore,the structural semantic associations between words can be accurately reflected.(2)Through the descriptions and explanation texts of the words,the property semantic feature matrix of the words is constructed,which can effectively compensate the problem of the insufficient training due to the sparsity of the context.(3)The synonym and antonym matrix of the words are constructed,which makes the synonyms have a closer distance,and the antonyms have a farther distance in the word embedding space.(4)The multi-source feature matrices are integrated by the inductive matrix complement algorithm,and the various relationships of words are trained to get the low-dimensional embeddings.The experimental results show that the proposed MSWE algorithm shows an excellent performance on the six similarity evaluation datasets.

关键词：词表示学习词表示词嵌入词向量词特征学习

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多源信息融合的分布式词表示学习被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多源信息融合的分布式词表示学习 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于多源信息融合的分布式词表示学习被引量：4