检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]六安职业技术学院,信息工程系,安徽六安237158 [2]安徽大学,计算机科学与技术学院,安徽合肥230039
出 处:《山东农业大学学报(自然科学版)》2014年第2期216-222,共7页Journal of Shandong Agricultural University:Natural Science Edition
基 金:安徽省高等学校省级自然科学研究项目(KJ2012B181);安徽省高等学校省级自然科学研究项目(KJ2012B183)
摘 要:研究并实现了kNN算法的手机短信客户端分类系统,从自建的短信语料库中提取到正常短信和垃圾短信两个特征向量集,通过预处理、降维和去除词频过小的特征项,使特征向量集可最大程度的载有该类短信的特征项。短信语料库分成比对库和测试库两部分。研究发现,比对库的短信数量n取600时分类效果最好,过小则降低短信的识别率,过大则提升分类时间复杂度,近邻数k取25时效果最优。同时研究了当k条短信选取时的概率差在1%~2%时,短信类别确定时的数量差在5到15之间时,效果最优。遵循保证正常短信的通过率的同时加大垃圾短信识别率的原则,kNN算法手机短信客户端分类系统的最终参数n取600,k取25,概率差取1.5%,数量差取9,可使得正常短信和垃圾短信识别率最高达到97.3%和89%。This paper studied and realized the SMS client classification system based on kNN algorithm and extracted two feature vectors set of the normal and spam SMS from the self-built SMS corpus, and made the feature vectors set get the feature item of the SMS to the maximum extent through the pretreatment, reducing dimension and removing the smaller frequency feature items. The study showed that the classification effect was the best when n was took 600,the SMS recognition rate reduced when n was too small, the classification time complexity enhanced when n too large, the optimum was neighbor number k to be took 25. At the meantime,the optimum effect was performed when the probability discrepancy of k SMS between 1%and 2%, and number discrepancy of which between 5 and 15. The recognition rate of normal and spam SMS was up to 97.3%and 89%when the final classification system parameter n was took 600, k was took 25,probability difference 1.5%,discrepancy number was took 9 to ensure the better normal SMS pass rate and spam SMS recognition rate.
分 类 号:TN929.53[电子电信—通信与信息系统]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28