检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:胡布焕 张晶[1] 张凌[1] HU Buhuan;ZHANG Jing;ZHANG Ling(Guangdong Province Key Laboratory of Computer Network,College of Computer Science and Technology,South China University of Technology,Guangzhou 510006,Guangdong Province,P.R.China)
机构地区:[1]广东省计算机网络重点实验室,华南理工大学计算机科学与工程学院,广东广州510641
出 处:《深圳大学学报(理工版)》2020年第S01期107-111,共5页Journal of Shenzhen University(Science and Engineering)
基 金:中国教育和科研计算机网资助项目(NGII20190615)。
摘 要:为解决在文本抄袭行为中由于避开检测而对文本内容进行的一些同义词替换、文本释义等操作问题,提出了一种基于语义相似计算的中文文档抄袭检测方法,将文档以句子为单位切分,利用word2vec模型将句子中的词语表示为词向量的形式,作为卷积神经网络(convolutional neural net-work,CNN)的输入,使用卷积神经网络提取和筛选句子的特征,计算句子对之间的差异,输出句子对的相似度,相似度高的句子对视为抄袭.利用大型可公开的腾讯云文本相似数据集检测试学生作业的抄袭情况,结果表明,传统的移动窗口指纹特征提取法虽然能够较为准确地找出两个文档中相同的片段,但是对于语义相似的文本容易受到噪声影响,提出的基于语义相似计算方法能够发现文档中语义相似的部分.In order to solve the problem of some operations that interfere with detection,such as synonym substitution,text paraphrase,etc.,we propose a Chinese documents plagiarism detection approach based on semantic textual similarity.Firstly,we divide the document into sentence units and use word2vec to have a vector representation of each word of a sentence as the input of the convolutional neural network(CNN).Then,the CNN is applied to extract and filter the features of sentences,calculate the difference between sentence pairs,output the similarity of sentence pairs.Pair sentences with the highest similarity are considered as the candidates for plagiarism.Finally,copy-and-paste documents and semantically similar documents are used as the dataset to verify and compare the proposed method with the traditional fingerprint feature extraction method.The proposed method is tested on a large publicly available Tencent cloud text similarity data set,and applied to the plagiarism detection of students homework.The results show that although the traditional fingerprint feature extraction method can find the same fragments in two documents accurately,it is sensitive to the noise in the semantically similar documents,while the proposed approach can overcome this disadvantage.
关 键 词:计算机科学 自然语言处理 抄袭检测 语义相似度 词向量表示
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7