基于音位的网络盗版文本查重方法

Method for Checking Duplicate Text of Network Piracy Based on Phoneme

作　　者：金哲凡[1] 俞定国[1] 林生佑[1] 周忠成[1] JIN Zhe-fan;YU Ding-guo;LIN Sheng-you;ZHOU Zhong-cheng(Zhejiang University of Media and Communications, Hangzhou 310018, China)

机构地区：[1]浙江传媒学院,浙江杭州310018

出　　处：《山东农业大学学报（自然科学版）》2017年第3期467-471,共5页Journal of Shandong Agricultural University：Natural Science Edition

基　　金：浙江省公益技术应用研究项目(2016C33196);浙江省公益性技术应用研究项目(2017C33105)

摘　　要：传统的文本查重算法是对文本作分词以构建关键词向量,而对于某些特殊应用的网络盗版检测,分词的开销则未必合理和必要。因此,本文提出一种基于汉语音位信息的文本查重方法。文本被表达为声、韵、调三个空间向量,以余弦距离作相似性度量。提出两种相似性判断公式,一种假定三向量独立分布;一种取其线性组合,系数可由音位元素的信息熵算出,通过大文本统计得出信息熵的估计值,以传统的关键词向量/Sim Hash方法做参照产生语料,对其作统计得到模型参数。实验结果表明该方法有一定的精确率和很好的召回率,计算开销低于传统的方法,适合需要过滤大量TN类型文本的场合。The traditional method checking repetition takes a text as a participle to establish some key vectors,however the piratical cost may not be reasonable or necessary for the discovery of the online copyright violation in some special APP.Therefore this paper proposed a method checking repetition with Chinese phonology.A text was represented by three vectors in spaces of Chinese initial,final and tone and cosine distance was used as a measurement of similarity.Two decision models were proposed.One assumed the three vectors were independent each other,while the other took a linear combination of the three,which needed to calculate the factors using information entropies that could be evaluated by large-corpus counting.Training corpus was generated with the old term-vector/SimHash method being used as a standard and threshold values were calculated.Test results showed the proposed method had a good precision and a very good recall ratio,and computational cost was lowed comparing to traditional methods based on term vectors to be suitable for filtering out a large amount of TN documents.

关键词：音位盗版文本查重

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于音位的网络盗版文本查重方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于音位的网络盗版文本查重方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索