一种基于随机n-Grams的文本相似度计算方法  被引量:9

A Novel Approach for Text Similarity Computing Based on Random n-Grams

在线阅读下载全文

作  者:王贤明[1] 胡智文[1] 谷琼[2] 

机构地区:[1]温州大学瓯江学院,温州325035 [2]湖北文理学院数学与计算机科学学院,襄阳441053

出  处:《情报学报》2013年第7期716-723,共8页Journal of the China Society for Scientific and Technical Information

基  金:国家自然科学基金项目(61172084);浙江省自然科学基金项目(Y1100137);乐清市科技项目(2011R003)

摘  要:文本相似度计算广泛应用于抄袭检测、自动问答系统、文本聚类等文本应用领域,然而传统的方法往往不具有语言无关性,且要花费大量的时间分析提取文档的特征项。针对目前相关方法的诸多不足,提出了一种基于随机n—Grams(Randomn—Gram,记为R-Gram)的长文本相似度算法,该算法具备语言无关性,且可以充分利用短n—Gram的细粒度检测特性和长n—Gram的高效检测特性。实验结果表明:基于R—Gram的文本相似度算法具有快速、操作简单、精度调控灵活等优点,在长文本相似度计算中具有良好的应用价值。Text similarity computing is widely used in many text applications such as plagiarism detection, automatic question answering system and text clustering. However, most traditional methods for computing text similarity are dependent on a special language and spend much time on analyzing and extracting of feature items. In view of the shortages of traditional methods, a novel algorithm based on Random n-Grams (R-Gram) with language independence for long text is proposed, which can make full use of fine-grained characteristics of short n-Grams and high efficiency characteristics of long n-Grams. The results strongly suggest that text similarity algorithm based on R-Gram have the advantages of fast speed, easy operation and flexibility. As a bonus, it is beneficial for text similarity computing for lung texts.

关 键 词:文本相似度 评价函数 集合 N-GRAM R-Gram 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象