基于词汇语义信息的文本相似度计算被引量：30

Text similarity computing based on lexical semantic information

机构地区：[1]上海大学通信与信息工程学院,上海200444 [2]中国科学院上海高等研究院新媒体无线技术研究中心,上海200120

出　　处：《计算机应用研究》2018年第2期391-395,共5页Application Research of Computers

摘　　要：传统的文本相似度计算大多基于词匹配的方法,忽略了词汇语义信息,计算结果很大程度上取决于文本的词汇重复率。虽然分布式词向量可以有效表达词汇语义关系,但目前基于词向量的文本处理方法大多通过词汇串联等形式表示文本,无法体现词汇在语料库中的分布情况。针对以上问题,提出了一种新的计算方法。该方法认为基于统计的文本向量各元素之间存在相关性,且该相关性可通过词汇语义相似度表示。因此,利用词汇相似度改进了基于余弦公式的文本相似度计算方法。实验表明该方法在F1值和准确度评价标准上优于其他方法。Traditional text similarity computation usually bases on word matching, which ignores the semantic information of the words, and the calculation results are limited by the repetition rate of the two texts. The distributed word vectors can effectively express semantic relations between words, but the text processing method based on word vector mostly express text by vocabulary series. In order to solve these problems, this paper proposed a new method to calculate the similarity of text. The method considered that there were correlations among the elements of the text vector. The correlations could be expressed by the semantic similarity of words. Therefore, the word similarity was used to improved cosine formula. It compared this method with other three methods on three popular datasets. The experimental results show that the proposed method outperforms other methods in F1 value and accuracy evaluation criteria.

关键词：文本相似度词向量词频—逆文档频率

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词汇语义信息的文本相似度计算被引量：30

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于词汇语义信息的文本相似度计算 被引量：30

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于词汇语义信息的文本相似度计算被引量：30