改进样本加权K近邻分类器用于垃圾网页检测  被引量:2

Improved K Nearest Neighbor Classifier with Instance Weighting for Web Spam Detection

在线阅读下载全文

作  者:吴俊华 谭博觉 高切 陈木生 WU Junhua;TAN Bojue;GAO Qie;Chen Musheng(School of Software Engineering,Jiangxi University of Science and Technology,Nanchang 330013,China)

机构地区:[1]江西理工大学软件工程学院,南昌330013

出  处:《重庆理工大学学报(自然科学)》2021年第7期283-290,共8页Journal of Chongqing University of Technology:Natural Science

基  金:江西省教育厅科学技术研究基金项目(GJJ180450)。

摘  要:针对垃圾网页检测过程中的"维数灾难"和不平衡分类问题,提出一种融合最优Fisher特征选择的样本加权K近邻分类器用于垃圾网页检测。首先,针对训练数据集进行Fisher特征选择,按Fisher Score从大到小排序,依次选择Fisher Score更大的特征对训练数据集进行样本加权的K近邻分类,根据训练数据集分类结果的AUC值是否增加以确定是否保留某个特征,最后基于保留的最优特征子集对测试数据集进行样本加权的K近邻分类。在WEBSPAM UK-2006数据集上的实验表明:该方法明显优于决策树、支持向量机、朴素贝叶斯、K近邻等传统分类器。与其他相关方法相比,该方法在准确率、F1测度和AUC指标上接近最优结果。Aiming at the problem of“the curse of dimensionality”and unbalanced classification in web spam detection,a novel classifier based on optimal Fisher feature selection and K nearest neighbor with instance weighting is proposed.First,Fisher feature selection is done based on the training dataset and all the features are sorted by their Fisher score descending.The features are selected according to the order of Fisher score descending to classify the training dataset by K nearest neighbor with instance weighting classifier.A feature is retained by the increase in the AUC value of the training dataset’s classification results.Finally,the testing dataset is classified by the K nearest neighbor with instance weighting classifier based on the optimal feature subset.The experimental results on WEBSPAM UK-2006 show that the proposed method is superior to the traditional classifiers such as decision tree,support vector machine,nave Bayes,K nearest neighbor etc.Compared with the state-of-the-arts methods,the proposed method is close to the optimal results on the accuracy,F1 measure and AUC index.

关 键 词:垃圾网页检测 特征选择 K近邻 不平衡数据分类 代价敏感分析 

分 类 号:TP391.6[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象