基于支持向量机的搜索引擎垃圾网页检测研究  被引量:5

Study of the Web Spam Detection Based on the Support Vector Machine

在线阅读下载全文

作  者:贾志洋[1] 李伟伟 高炜[3] 夏幼明[3] 

机构地区:[1]云南大学旅游文化学院,云南丽江674100 [2]宁德职业技术学院计算机科学系,福建宁德352000 [3]云南师范大学信息学院,云南昆明650040

出  处:《云南民族大学学报(自然科学版)》2011年第3期173-176,共4页Journal of Yunnan Minzu University:Natural Sciences Edition

基  金:国家自然科学基金(60903131);云南省教育厅科学研究基金(2010Y108)

摘  要:搜索引擎垃圾网页作弊的检测问题一般被视为一个二元分类问题,基于机器学习的分类算法建立分类器,将网页分成正常网页和垃圾网页2类.现有的基于内容特征的垃圾网页检测模型忽略了网页之间的链接关系,故构建了软间隔支持向量机分类器,以网页的内容特征作为支持向量,根据网页之间的链接具有相似性的特点定义了惩罚函数,使用样本集学习,得出了线性支持向量机网页分类器,并对分类器的分类效果进行了测试.实验结果表明基于支持向量机的分类器的效果明显好于使用内容特征构建的决策树分类器.With the widespread application of search engines, some web pages often canT out cheating the search engines for the purpose of increasing rankings in the search results. These web pages are called web spam. The web spam detection problem is viewed as a classification problem, and that means classification models are created by machine learning classification algorithms, which include two categories: Normal and Spam. Content-based classification models usually ignore the link structures of web pages. So the soft margin support vector machine classification model which takes the content features as the support vector has been developed by learning the sample set, and penalty functions are defined according to the links between web pages that seems to have similar characteristics. The classification effect of the model is also studied. The experimental results have showed that the effect of the support vector machine-based classifier is significantly better than the decision tree classifier built by content features.

关 键 词:垃圾网页 垃圾网页检测 机器学习 网页分类 支持向量机 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象