检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:吴小晴 万国金[1] 李程文 林梦思 曹书强 WU Xiaoqing;WAN Guojin;LI Chengwen;LIN Mengsi;CAO Shuqiang(School of Information Engineering,Nanchang University,Nanchang 330031,China)
出 处:《现代电子技术》2020年第12期83-86,共4页Modern Electronics Technique
基 金:国家自然科学基金项目(61661030)。
摘 要:传统的TF-IDF算法没有很好地分配分词的权重,对一些能代表邮件类别出现频率较大的词语计算的IDF值反而较小,IDF值小说明单词的区分能力弱而不符合实际情况。为了提升垃圾邮件识别的准确率,提出一种改进TF-IDF算法和类中心向量的中文垃圾邮件识别方法。通过改进传统的TF-IDF计算方式,在传统的TF-IDF算法里面加入卡方统计量CHI和位置影响因子能够很好地改善一些重要词汇的权重问题,并结合逆向最大匹配算法的邮件文本分词和类中心向量算法的特征选择进行垃圾邮件分类。实验结果表明,所提算法相较于传统的TF-IDF算法对垃圾邮件识别的准确率提升了约3.6%,具有一定的实际应用价值。A Chinese spam recognition method with improved TF-IDF algorithm and class centre vector is proposed to improve the accuracy of spam recognition. The traditional TF-IDF algorithm does not assign the weight of word segmentation well,and the calculated IDF value for some words that can represent the mail category and has higher frequency of occurrence is relatively small. The small IDF value indicates that the capacity of distinguishing the words is weak and does not accord with the actual demand. In this paper,the traditional TF-IDF calculation pattern is improved. The traditional TF-IDF algorithm adding the chi-square statistic CHI and position influence factor can improve the weight of some important words,and the spam classification can be performed by combining it with the feature selection of class center vector algorithm and mail text segmentation of the reverse maximum matching algorithm. The experimental results show that,in comparison with the traditional TF-IDF algorithm,this algorithm can increase the accuracy of spam identification by about 3.6%,which has a certain practical application value.
关 键 词:TF-IDF算法 邮件识别 卡方统计量 权重分配 邮件分类 仿真分析
分 类 号:TN911.23-34[电子电信—通信与信息系统] TP181[电子电信—信息与通信工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.144.81.47