kNN算法在手机短信客户端分类中的应用研究  被引量:1

Research on the Application for kNN Algorithm at SMS Client Classification

在线阅读下载全文

作  者:王红[1] 张燕平[2] 陈功平[1] 

机构地区:[1]六安职业技术学院,信息工程系,安徽六安237158 [2]安徽大学,计算机科学与技术学院,安徽合肥230039

出  处:《山东农业大学学报(自然科学版)》2014年第2期216-222,共7页Journal of Shandong Agricultural University:Natural Science Edition

基  金:安徽省高等学校省级自然科学研究项目(KJ2012B181);安徽省高等学校省级自然科学研究项目(KJ2012B183)

摘  要:研究并实现了kNN算法的手机短信客户端分类系统,从自建的短信语料库中提取到正常短信和垃圾短信两个特征向量集,通过预处理、降维和去除词频过小的特征项,使特征向量集可最大程度的载有该类短信的特征项。短信语料库分成比对库和测试库两部分。研究发现,比对库的短信数量n取600时分类效果最好,过小则降低短信的识别率,过大则提升分类时间复杂度,近邻数k取25时效果最优。同时研究了当k条短信选取时的概率差在1%~2%时,短信类别确定时的数量差在5到15之间时,效果最优。遵循保证正常短信的通过率的同时加大垃圾短信识别率的原则,kNN算法手机短信客户端分类系统的最终参数n取600,k取25,概率差取1.5%,数量差取9,可使得正常短信和垃圾短信识别率最高达到97.3%和89%。This paper studied and realized the SMS client classification system based on kNN algorithm and extracted two feature vectors set of the normal and spam SMS from the self-built SMS corpus, and made the feature vectors set get the feature item of the SMS to the maximum extent through the pretreatment, reducing dimension and removing the smaller frequency feature items. The study showed that the classification effect was the best when n was took 600,the SMS recognition rate reduced when n was too small, the classification time complexity enhanced when n too large, the optimum was neighbor number k to be took 25. At the meantime,the optimum effect was performed when the probability discrepancy of k SMS between 1%and 2%, and number discrepancy of which between 5 and 15. The recognition rate of normal and spam SMS was up to 97.3%and 89%when the final classification system parameter n was took 600, k was took 25,probability difference 1.5%,discrepancy number was took 9 to ensure the better normal SMS pass rate and spam SMS recognition rate.

关 键 词:短信分类 KNN算法 特征向量集 向量空间模型 

分 类 号:TN929.53[电子电信—通信与信息系统]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象