基于二元分类的伪装型垃圾网页高效检测方法  

Efficient Detection Method of Camouflage Garbage Pages Based on Binary Classification

在线阅读下载全文

作  者:魏欢[1] WEI Huan(College of Computer and Arts,Anhui Technical College of Industry and Economy,Hefei 230051,China)

机构地区:[1]安徽工业经济职业技术学院计算机与艺术学院

出  处:《兰州工业学院学报》2019年第4期76-80,共5页Journal of Lanzhou Institute of Technology

基  金:安徽省质量工程项目(2015M00C144)

摘  要:为了提高伪装型垃圾网页检测能力,提出一种基于二元分类的伪装型垃圾网页检测算法.对采集的各类网站网页样本进行暗链域名特征分析和网页爬虫分析,构建伪装型垃圾网页分布的相关文本和图片信息特征,对伪装型垃圾网页样本集采用垂直爬虫和异常特征挖掘方法进行垃圾信息过滤;以网页赋权垃圾信息为测试集,采用二元分类方法对伪装型垃圾网页进行路径模板分析,对全部的异常样本进行垂直爬虫检索;提取伪装型垃圾网页的相关文本的字体颜色与网页背景色,将伪装型垃圾网页的特征提取结果输入到二元语义分类器中进行数据分类,结合大数据融合聚类方法实现伪装型垃圾网页检测.仿真结果表明:采用该方法进行伪装型垃圾网页检测的准确性较高,抗垃圾网页和暗链接干扰能力较好,提高了网页安全监控能力.In order to improve the detection ability of camouflaged garbage pages,an algorithm based on binary classification is proposed.Based on the analysis of the dark chain domain name and the crawler,the text and picture information features of the distribution of camouflaged garbage pages are constructed.In this paper,vertical crawler and abnormal feature mining methods are used to filter the garbage information in the sample set of camouflaged garbage pages,and the weighted spam information is used as the test set.The path template analysis is carried out by using the binary classification method.All abnormal samples are retrieved by vertical crawler,the font color and background color of the text are extracted,and the feature extraction results of the camouflaged garbage page are input into the binary semantic classifier for data classification.The big data fusion clustering method is combined with to realize camouflage garbage page detection.The simulation results show that the proposed method is more accurate and can resist the interference of spam pages and dark links and improve the security monitoring ability of web pages.

关 键 词:二元分类 垃圾网页 暗链接 检测 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象