基于改进TFIDF算法的邮件分类技术  被引量:3

Mail Sorting Technology Based on Improved TFIDF

在线阅读下载全文

作  者:陶峰 汤鲲 程光[3] TAO Feng;TANG Kun;CHENG Guang(Wuhan Research Institute of Posts and Telecommunications,Wuhan 430074,China;FiberHome StarrySky Co.,Ltd.,Nanjing 210019,China;School of Computer Science and Engineering,Southeast University,Nanjing 210096,China)

机构地区:[1]武汉邮电科学研究院,湖北武汉430074 [2]南京烽火星空通信发展有限公司,江苏南京210019 [3]东南大学计算机科学与工程学院,江苏南京210096

出  处:《计算机技术与发展》2018年第8期27-31,共5页Computer Technology and Development

基  金:国家"863"高技术发展计划项目(2015AA015603);国家自然科学基金(61602114)

摘  要:随着电子邮件的普及,垃圾邮件的泛滥问题也逐渐引起人们的关注,垃圾邮件分类技术的研究成为了近年来的热点课题。邮件特征选择会直接影响到分类的效率和精确度,使用TFIDF算法可以有效评估一个特征项对于邮件分类的重要程度。但在邮件分类中单纯使用TFIDF来判断一个特征是否有区分度还存在很多的不足:没有考虑到特征词在类间和类内的分布情况,低估了高频词的作用并高估了低频词的作用。对TFIDF算法进行修改,降低特例邮件中频繁出现的特征词的影响,引入了频率差,增加了在类中频繁出现的词条的权值,并减小了在类中出现频率小的词条的权值。最终将改进的TFIDF算法与传统特征提取算法进行对比。实验结果表明,改进算法可以选择出更合适的特征项集合,从而使邮件分类的效果更好。With the popularity of e-mail,the proliferation of spam has gradually attracted people’s attention,and the research on spam classification technology has become a hot topic in recent years.Mail feature selection will directly affect the efficiency and accuracy of classification,the use of TFIDF algorithm can effectively assess the characteristics of a feature for the classification of the importance of the message.However,the use of TFIDF in the classification of mail to determine whether there is a distinction between the characteristics exists a lot of problems:not taking into account the characteristics of the word in the category and the distribution of classes,underestimated the role of high frequency words and overestimated the role of low frequency words.In this paper,we modify the TFIDF algorithm to reduce the influence of the frequent occurrence of feature words in special cases,and introduce the frequency difference to increase the weight of the entries that appear frequently in the class and reduce the weight of the entries with low frequency of occurrence in the class.Finally,the improved TFIDF algorithm is compared with the traditional feature extraction algorithm.The experiment shows that the improved algorithm can choose a more suitable set of feature items,so that the effect of mail classification is better.

关 键 词:邮件分类 区分度 特征词 权值 特征提取 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象