一种基于语义相似的中文文档抄袭检测方法

A plagiarism detection approach for Chinesedocuments based on semantic textual similarity

作　　者：胡布焕张晶[1] 张凌[1] HU Buhuan;ZHANG Jing;ZHANG Ling(Guangdong Province Key Laboratory of Computer Network,College of Computer Science and Technology,South China University of Technology,Guangzhou 510006,Guangdong Province,P.R.China)

机构地区：[1]广东省计算机网络重点实验室,华南理工大学计算机科学与工程学院,广东广州510641

出　　处：《深圳大学学报（理工版）》2020年第S01期107-111,共5页Journal of Shenzhen University(Science and Engineering)

基　　金：中国教育和科研计算机网资助项目(NGII20190615)。

摘　　要：为解决在文本抄袭行为中由于避开检测而对文本内容进行的一些同义词替换、文本释义等操作问题,提出了一种基于语义相似计算的中文文档抄袭检测方法,将文档以句子为单位切分,利用word2vec模型将句子中的词语表示为词向量的形式,作为卷积神经网络(convolutional neural net-work,CNN)的输入,使用卷积神经网络提取和筛选句子的特征,计算句子对之间的差异,输出句子对的相似度,相似度高的句子对视为抄袭.利用大型可公开的腾讯云文本相似数据集检测试学生作业的抄袭情况,结果表明,传统的移动窗口指纹特征提取法虽然能够较为准确地找出两个文档中相同的片段,但是对于语义相似的文本容易受到噪声影响,提出的基于语义相似计算方法能够发现文档中语义相似的部分.In order to solve the problem of some operations that interfere with detection,such as synonym substitution,text paraphrase,etc.,we propose a Chinese documents plagiarism detection approach based on semantic textual similarity.Firstly,we divide the document into sentence units and use word2vec to have a vector representation of each word of a sentence as the input of the convolutional neural network(CNN).Then,the CNN is applied to extract and filter the features of sentences,calculate the difference between sentence pairs,output the similarity of sentence pairs.Pair sentences with the highest similarity are considered as the candidates for plagiarism.Finally,copy-and-paste documents and semantically similar documents are used as the dataset to verify and compare the proposed method with the traditional fingerprint feature extraction method.The proposed method is tested on a large publicly available Tencent cloud text similarity data set,and applied to the plagiarism detection of students homework.The results show that although the traditional fingerprint feature extraction method can find the same fragments in two documents accurately,it is sensitive to the noise in the semantically similar documents,while the proposed approach can overcome this disadvantage.

关键词：计算机科学自然语言处理抄袭检测语义相似度词向量表示

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于语义相似的中文文档抄袭检测方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于语义相似的中文文档抄袭检测方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索