不均衡训练集下短信过滤系统kNN方法的研究  被引量:1

RESEARCH ON kNN METHOD FOR SHORT MESSAGES FILTERING SYSTEM WITH UNBALANCED TRAINING SET

在线阅读下载全文

作  者:徐山[1] 杜卫锋[2] 

机构地区:[1]南京城市职业学院教务处,江苏南京210038 [2]嘉兴学院数理与信息工程学院,浙江嘉兴314001

出  处:《计算机应用与软件》2013年第11期84-86,共3页Computer Applications and Software

基  金:国家自然科学基金项目(61175055;61070213);浙江省自然科学基金项目(Y1080901)

摘  要:不良短信的泛滥,严重影响了社会风气,干扰了人们正常的生活秩序,研发不良短信过滤技术具有相当的实用价值。应用中科院计算所研制开发的ICTCLAS分词系统,结合TFIDF词权度量指标提取关键词,实现短信文本到特征向量的转换,然后采用kNN方法实现短信的类别判断,从而实现不良短信的过滤。另外,针对训练集分布不均衡的情况,应用基于密度的改进方法,较为有效地处理了原来分类结果倾向于大类别样本的情况。实验表明,改进后的方法的准确率约79.18%,比原方法提升了约1.23%。该方法能够比较有效地过滤不良短信,具有一定的实用价值。The overrunning of the unwanted short messages seriously impacts the social ethos and disrupts the normal life order of people. It has considerable practical value to research and develop the filtering technology of harmful short messages. In this paper, ICTCLAS segmentation system developed by the Institute of Computing Technology of CAS is applied to realist the transition of short message text to the eigenvectors in combination with keywords extraction using TFIDF word right metrics, then the kNN method is adopted to realise the discriminant of short messagescategories, thus the filtration of bad short messages is realised. In addition, according to the unbalanced distribution of training set, we apply the density-based improved method to solve the case of original classification results which are prone to the categories of big sample quite efficiently. Experiments show that the accuracy rate of the improved method reaches about 79. 18% , a 1.23% increase compared with the originalmethod. This method is able to more effectively filter the unwanted short messages, and has certain practical value.

关 键 词:短信过滤 不均衡训练集 k近邻方法 向量空间模型 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象