基于局部敏感哈希的K邻近算法识别垃圾短信  

Recognition of Spam Text Messages Based on Local Sensitive Hash K Nearest Neighbor Algorithm

在线阅读下载全文

作  者:樊继慧 滕少华[3] FAN Jihui;TENG Shaohua(Department of Graduate School,Saint Paul University,Tuguegarao 3500,Philippines;School of Computer Science and Engineering,Guangzhou Institute of Science and Technology,Guangzhou 510540,Guangdong,China;School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,Guangdong,China)

机构地区:[1]菲律宾圣保罗大学研究生院,土格加劳3500,菲律宾 [2]广州理工学院计算机科学与工程学院,广东广州510540 [3]广东工业大学计算机学院,广东广州510006

出  处:《济南大学学报(自然科学版)》2023年第6期746-751,共6页Journal of University of Jinan(Science and Technology)

基  金:国家自然科学基金项目(61972102);广东省教育厅重大专项(粤教2021ZDZX1070);教育部协同育人项目(GZLGHT2021324);广东省高等教育协会研究项目(22GQN37);广州理工学院校本研究项目(2021XBZ03)。

摘  要:针对目前垃圾短信的识别算法存在的关键字及频次的规则死板,易于被不法分子探测和规避等问题,提出将局部敏感哈希的K邻近算法应用于垃圾短信分类识别;首先定义特征,然后采用局部敏感哈希算法计算向量距离,通过得到的距离衡量矩阵的相似性,量化矩阵相似程度,对本文中提出的优化模型进行实现和训练;基于短信文本内容,运用词频-逆向文本频率算法生成矩阵,利用局部敏感哈希算法求解最相似样本,记录样本类别,将训练结果导入K邻近算法分类器得到最优近邻,在测试集或验证集上对优化模型垃圾短信分类识别准确率进行评测。结果表明,经过K邻近算法分类器后,优化模型垃圾短信分类识别准确率达到98.7%。Aiming at the problems of the current junk message recognition algorithm,such as the inflexible rules of keywords and frequency,and easy to be detected and evaded by criminals,a K nearest neighbor algorithm based on local sensitive hash was proposed for the classification and recognition of spam text messages.First,the feature was defined,and then the local sensitive hash algorithm was used to calculate the vector distance.The distance obtained measured the similarity of the matrix,quantizes the similarity of the matrix,and implemented and trained the optimization model proposed in this paper.Based on the text content of short messages,the word frequency reverse text frequency algorithm was used to generate the matrix,the local sensitive hash algorithm was used to solve the most similar samples,record the sample categories,and import the training results into the K nearest neighbor algorithm classifier to obtain the best nearest neighbor,and the classification and recognition accuracy of spam short messages was eveluated in the test set or verification set.The results show that after K nearest neighbor classifier,the improved model achieves 98.7%accuracy of spam short message classification.

关 键 词:垃圾短信识别 K邻近算法 局部敏感哈希 矩阵相似性 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象