不同长度下中文垃圾邮件分类模型的研究  被引量:1

Performance and Selection of Chinese Spam Classification Model Under Different Lengths

在线阅读下载全文

作  者:顾孟钧 冯文舟 陈中兵 Gu Mengjun;Feng Wenzhou;Chen Zhongbing(China Telecom Zhejiang Brach,Hangzhou Zhejiang,310000;Public Security Bureau of Linhai City,Taizhou Zhejiang,318000;Zhejiang Public Information Industry Co.,Ltd,Hangzhou Zhejiang,310000)

机构地区:[1]中国电信股份有限公司浙江分公司,浙江杭州310000 [2]浙江省台州临海市公安局,浙江台州318000 [3]浙江省公众信息产业有限公司,浙江杭州310000

出  处:《工业信息安全》2022年第7期28-35,共8页Industry Information Security

摘  要:针对日益泛滥的垃圾邮件问题,本文使用多种算法对不同长度下中文垃圾邮件分类模型进行比较研究。首先,使用朴素贝叶斯算法对邮件数据集进行训练和测试;然后,从邮件数据集中筛选出三种不同文本长度的数据集和两种不同大小样本量的数据集,组成五个实验样本集;最后分别使用多种传统机器学习模型、神经网络模型和预训练模型在五个实验样本集上进行建模比较。实验结果表明,预训练模型ALBERT最适合分类句子长度的中文垃圾邮件,传统机器学习模型SVM最适合分类段落长度的中文垃圾邮件,神经网络模型TextRCNN最适合分类篇章长度的中文垃圾邮件。实验结果还显示,神经网络模型TextRNN和预训练模型RoBERTa不适用于小样本数据。In response to the increasingly widespread spam problem,this paper uses a variety of algorithms to compare Chinese spam classification models with different lengths.Firstly,use the naive Bayes algorithm to train and test the mail dataset.Then,three datasets with different text lengths and two datasets with different sample sizes were screened out from the email dataset to form five experimental sample sets.Finally,a variety of traditional machine learning models,neural network models and pre-trained models are used to model and compare on five experimental sample sets.The experimental results show that the pre-trained model ALBERT is best for classifying Chinese spam with sentence length,the traditional machine learning model SVM is best for classifying Chinese spam with paragraph length,and the neural network model TextRCNN is best for classifying Chinese spam with text length.The experimental results also show that the neural network model TextRNN and the pre-trained model RoBERTa are not suitable for small sample data.

关 键 词:中文垃圾邮件 文本分类 机器学习 深度学习 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术] TP18[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象