一种改进TF-IDF的中文邮件识别算法研究被引量：9

Research on improved TF-IDF Chinese mail recognition algorithm

作　　者：吴小晴万国金[1] 李程文林梦思曹书强 WU Xiaoqing;WAN Guojin;LI Chengwen;LIN Mengsi;CAO Shuqiang(School of Information Engineering,Nanchang University,Nanchang 330031,China)

机构地区：[1]南昌大学信息工程学院,江西南昌330031

出　　处：《现代电子技术》2020年第12期83-86,共4页Modern Electronics Technique

基　　金：国家自然科学基金项目(61661030)。

摘　　要：传统的TF-IDF算法没有很好地分配分词的权重,对一些能代表邮件类别出现频率较大的词语计算的IDF值反而较小,IDF值小说明单词的区分能力弱而不符合实际情况。为了提升垃圾邮件识别的准确率,提出一种改进TF-IDF算法和类中心向量的中文垃圾邮件识别方法。通过改进传统的TF-IDF计算方式,在传统的TF-IDF算法里面加入卡方统计量CHI和位置影响因子能够很好地改善一些重要词汇的权重问题,并结合逆向最大匹配算法的邮件文本分词和类中心向量算法的特征选择进行垃圾邮件分类。实验结果表明,所提算法相较于传统的TF-IDF算法对垃圾邮件识别的准确率提升了约3.6%,具有一定的实际应用价值。A Chinese spam recognition method with improved TF-IDF algorithm and class centre vector is proposed to improve the accuracy of spam recognition. The traditional TF-IDF algorithm does not assign the weight of word segmentation well,and the calculated IDF value for some words that can represent the mail category and has higher frequency of occurrence is relatively small. The small IDF value indicates that the capacity of distinguishing the words is weak and does not accord with the actual demand. In this paper,the traditional TF-IDF calculation pattern is improved. The traditional TF-IDF algorithm adding the chi-square statistic CHI and position influence factor can improve the weight of some important words,and the spam classification can be performed by combining it with the feature selection of class center vector algorithm and mail text segmentation of the reverse maximum matching algorithm. The experimental results show that,in comparison with the traditional TF-IDF algorithm,this algorithm can increase the accuracy of spam identification by about 3.6%,which has a certain practical application value.

关键词：TF-IDF算法邮件识别卡方统计量权重分配邮件分类仿真分析

分类号：TN911.23-34[电子电信—通信与信息系统] TP181[电子电信—信息与通信工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进TF-IDF的中文邮件识别算法研究被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进TF-IDF的中文邮件识别算法研究 被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种改进TF-IDF的中文邮件识别算法研究被引量：9