检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:高晶[1,3] 曹福凯 闫明 Muhd Khaizer Omar GAO Jing;CAO Fu-kai;YAN Ming;Muhd Khaizer Omar(Jitang College,North China University of Science and Technology,TangshanHebei 063210,China;North China University of Science and Technology Archives,Tangshan Hebei063210,China;Faculty of Educational Studies,Universiti Putra Malaysia,Putrajaya,UPM Serdang,Selangor,Malaysia,43400)
机构地区:[1]华北理工大学冀唐学院,河北唐山063210 [2]华北理工大学,河北唐山063210 [3]Faculty of Educational Studies Universiti Putra Malaysia,PutrajayaUPM Serdang,Selangor,Malaysin,43400
出 处:《计算机仿真》2021年第10期462-465,470,共5页Computer Simulation
摘 要:目前重复数据分级索引方法没有对数据进行预处理,存在分级效率低、准确率低和相似数据提取率低的问题。提出信息相似性下网络对抗文本重复数据分级索引方法。方法首先构建出向量空间模型,将所有文本转换成互联网可识别的特定模式,并算出数据特征项及其权重以此将数据进行一个简单分类,并利用编辑距离法详细计算出特征项之间的相似度,最终利用朴素贝叶斯分类器经过重重训练,实现重复数据分级索引。实验结果表明,信息相似性下网络对抗文本重复数据分级索引方法的分级效率较高,准确率较高,相似数据提取率高。At present, the hierarchical index method of duplicate data does not preprocess the data, which has the problems of low classification efficiency, low accuracy and low extraction rate of similar data. In this regard, we report a hierarchical indexing method based on information similarity for network against duplicate text data. Firstly, the vector space model was established. Secondly, all texts were transformed into specific patterns that can be recognized by the Internet to calculate the data features and their weights. And then, the edit distance method was applied to calculate the similarity between features in detail. Eventually, naive Bayes classifier was used to achieve hierarchical index of duplicate data. The results show that the method has high classification efficiency, accuracy and similar data extraction rate.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.30