面向垃圾邮件过滤的典型机器学习算法比较研究  

The comparison of spam filter based on generative model and discriminative model

在线阅读下载全文

作  者:丁华福[1] 王莹莹[1] 韩咏[2] 闵莉[2] 邹钰[2] 

机构地区:[1]哈尔滨理工大学计算机科学与技术学院,黑龙江哈尔滨150080 [2]黑龙江工程学院计算机科学与技术学院,黑龙江哈尔滨150050

出  处:《黑龙江工程学院学报》2012年第2期65-69,共5页Journal of Heilongjiang Institute of Technology

基  金:黑龙江省教育厅科学技术研究(面上)项目(12511444)

摘  要:基于机器学习的垃圾邮件过滤技术是当前垃圾邮件过滤的主流方法。机器学习模型主要分为两类:以朴素贝叶斯(NB)为代表的生成模型和以逻辑回归模型(LR)、支持向量机模型(SVM)为代表的判别学习模型。以往对两种模型的研究都是针对某一种语言进行,对于模型的语言独立性与相关性研究较少。因此,在中文数据集和英文数据集上比较典型的生产模型和判别学习模型的过滤性能。比较Bogo(Bogo系统是基于贝叶斯算法的,它是典型的生成模型)、逻辑回归模型和松弛在线支持向量机(两种典型的判别学习模型)在中英文数据集上的过滤性能。其中:实验是在公开英文数据集TREC05p-1、TREC06p和公开中文数据集TREC06c、SEWM2011上进行。实验结果显示基于判别模型垃圾邮件过滤器性能明显优于基于生成模型,并且相同的模型在中文数据集上显示了较好的效果。The model of spam filter which bases on machine learning is the main method of model of spam filter. Machine learning model is divided into two categories: the generative model which is representative by Naive Bayes and the discriminative model which is representative by Logistic Regression (LR) and Sup- port Vector Machine (SVM). Previous studies of two models are on a certain language, the studies of the independence of the language are less. Therefore, the article compared the performance of typical repre- sentative model and discriminative model on Chinese data set and English data set. The article compared the performance of Bogo which is generative model and Logistic Regression, Relaxed Online SVM which are two discriminative model. We choose the public English datasets: TREC05p-1, TREC06p; Public Chi- nese datasets: TREC06c, SEWM 2011, as the test dataset with immediate feedback. The discriminative model gives the better results than the generative model based on spam filter. And the same model gives the better results on the Chinese datasets. ROSVM gives the best performance on Chinese spam filter.

关 键 词:生成模型 判别模型 中文垃圾邮件过滤 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象