一种基于Spark的论文相似性快速检测方法被引量：2

An Approach for Scientific Paper Similarity Rapid Detection Based on Spark

机构地区：[1]南京大学信息管理学院,南京210023 [2]江苏省数据工程与知识服务重点实验室(南京大学),南京210023

出　　处：《图书情报工作》2015年第11期134-142,共9页Library and Information Service

基　　金：国家社会科学基金重大项目"面向突发事件应急决策的快速响应情报体系研究"(项目编号:13&ZD174);国家社会科学基金项目"基于关联数据的图书馆语义云服务研究"(项目编号:12CTQ009);江苏省社会科学项目青年项目"基于语义云服务的数字阅读推广研究"(项目编号:14TQC003);中央高校基本科研业务费专项资金资助项目"基于用户的标语用分析的社会化标签知识组织研究"(项目编号:1435003);江苏省高校自然科学研究面上资助项目"基于语义消歧技术的社会化标签知识组织研究"(项目编号:15KJB520013)研究成果之一

摘　　要：[目的/意义]从大规模已知文本集中检测出与待检测论文的相似文本并计算相似度大小,用于满足在线论文相似性检测秒级响应需求。[方法/过程]采用分治法策略,对已知文本句集进行基于正交基的软聚类预处理,并对软聚类后的每个簇建立倒排索引。接着在快数据处理平台Spark上执行相似性检测,采用字符结合词组形式计算出待检测论文与已知文本的相似度大小。[结果/结论]通过200万规模的已知文本集实验结果显示,综合4种类型的待检测论文,所提出的倒排索引结合软聚类算法准确率P为100.0%,召回率R为93.6%,调和平均值F为96.7%。调和平均值F比相似性检测算法LCS高10%左右,比Simhash算法高约23%。在检测速度上,对于一篇字数为5 000左右的待检测论文,检测时间约为6.5秒,比Simhash算法快近300倍,比LCS算法快约4 000倍,此外,实验结果还表明基于Spark的分布式并行相似性检测算法具有较好的可扩展性。[ Purpose/significance ] This paper detects the texts similar with papers to be detected from the large scale known texts and computes their similarities, to meet the second response requirement of online paper similarity de- tection. [ Method/process ] It uses divide and conquer strategy to softly cluster known text sentence set, and establishes inverted index for each cluster after soft clustering. Then it performs the similarity computing between papers to be detec- ted and known texts on the fast data processing platform - Spark, using the method of character combined with phrase. [ Result/conclusion ] Through the experiment of two million known texts set, the results show that the proposed inverted index algorithm combined with soft clustering has precision rate P 100.0% , recall rate R 93.6% and harmonic mean F value 96.7%, integrating four types of papers to be detected. The harmonic mean F is about t0% higher than LCS algo- rithm and 23 % higher than Simhash algorithm. In the detection of the paper with 5 000 words, the proposed algorithm has the detection time of about 6.5 seconds, nearly 300 times faster than the Simhash algorithm, and approximately 4 000 times faster than LCS algorithm. In addition, the results also show that the Spark based distributed parallel similarity de- tection algorithm has better scalability.

关键词：论文相似性检测 Spark快数据处理正交基软聚类倒排索引

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于Spark的论文相似性快速检测方法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于Spark的论文相似性快速检测方法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种基于Spark的论文相似性快速检测方法被引量：2