信息相似性下网络对抗文本重复数据分级索引  

Hierarchical Indexing of Text Duplicate Data in Network Confrontation under Information Similarity

在线阅读下载全文

作  者:高晶[1,3] 曹福凯 闫明 Muhd Khaizer Omar GAO Jing;CAO Fu-kai;YAN Ming;Muhd Khaizer Omar(Jitang College,North China University of Science and Technology,TangshanHebei 063210,China;North China University of Science and Technology Archives,Tangshan Hebei063210,China;Faculty of Educational Studies,Universiti Putra Malaysia,Putrajaya,UPM Serdang,Selangor,Malaysia,43400)

机构地区:[1]华北理工大学冀唐学院,河北唐山063210 [2]华北理工大学,河北唐山063210 [3]Faculty of Educational Studies Universiti Putra Malaysia,PutrajayaUPM Serdang,Selangor,Malaysin,43400

出  处:《计算机仿真》2021年第10期462-465,470,共5页Computer Simulation

摘  要:目前重复数据分级索引方法没有对数据进行预处理,存在分级效率低、准确率低和相似数据提取率低的问题。提出信息相似性下网络对抗文本重复数据分级索引方法。方法首先构建出向量空间模型,将所有文本转换成互联网可识别的特定模式,并算出数据特征项及其权重以此将数据进行一个简单分类,并利用编辑距离法详细计算出特征项之间的相似度,最终利用朴素贝叶斯分类器经过重重训练,实现重复数据分级索引。实验结果表明,信息相似性下网络对抗文本重复数据分级索引方法的分级效率较高,准确率较高,相似数据提取率高。At present, the hierarchical index method of duplicate data does not preprocess the data, which has the problems of low classification efficiency, low accuracy and low extraction rate of similar data. In this regard, we report a hierarchical indexing method based on information similarity for network against duplicate text data. Firstly, the vector space model was established. Secondly, all texts were transformed into specific patterns that can be recognized by the Internet to calculate the data features and their weights. And then, the edit distance method was applied to calculate the similarity between features in detail. Eventually, naive Bayes classifier was used to achieve hierarchical index of duplicate data. The results show that the method has high classification efficiency, accuracy and similar data extraction rate.

关 键 词:相似度 重复数据 分级索引 降维 特征提取 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象