一种基于词义向量模型的词语语义相似度算法  被引量:20

An Algorithm of Semantic Similarity Between Words Based on Word Single-meaning Embedding Model

在线阅读下载全文

作  者:李小涛 游树娟 陈维 LI Xiao-Tao;YOU Shu-Juan;CHEN Wai(China Mobile Research Institute,Beijing 100053)

机构地区:[1]中国移动研究院,北京100053

出  处:《自动化学报》2020年第8期1654-1669,共16页Acta Automatica Sinica

摘  要:针对基于词向量的词语语义相似度计算方法在多义词、非邻域词和同义词三类情况计算准确性差的问题,提出了一种基于词义向量模型的词语语义相似度算法.与现有词向量模型不同,在词义向量模型中多义词按不同词义被分成多个单义词,每个向量分别与词语的一个词义唯一对应.我们首先借助同义词词林中先验的词义分类信息,对语料库中不同上下文的多义词进行词义消歧;然后基于词义消歧后的文本训练词义向量模型,实现了现有词向量模型无法完成的精确词义表达;最后对两个比较词进行词义分解和同义词扩展,并基于词义向量模型和同义词词林综合计算词语之间的语义相似度.实验结果表明本文算法能够显著提升以上三类情况的语义相似度计算精度.We propose a novel algorithm of semantic similarity between words,based on our word single-meaning em-bedding model,to address the issue of existing word-embedding-based approaches that have low computation accuracy in polysemous words,nonadjacent words and synonyms.Differently from the existing word embedding models,each pol-ysemous word is decomposed into a series of monosemous words in our model,and there is a one-to-one correspondence between a word meaning and a vector.First of all,the word sense disambiguation(WSD)of polysemous words in different contexts of the corpus is achieved with the help of the prior classification information contained in Tongyici Cilin.Then,the word single-meaning embeddings are learned from the processed corpus and realize the precise expression for each word meaning,and as far as we know,no existing word embedding model could complete this task.At last,two test words are decomposed into marked monosemous words according to the number of meaning and expanded with synonyms,and then semantic relatedness between words is computed based on the word single-meaning embedding model and Tongyici Cilin.The experimental results showed our method can significantly improve the computation accuracy of polysemous words,nonadjacent words and synonyms.

关 键 词:词语语义相似度 Word2vec 同义词词林 词义消歧 词义向量 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象