检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]南京城市职业学院教务处,江苏南京210038 [2]嘉兴学院数理与信息工程学院,浙江嘉兴314001
出 处:《计算机应用与软件》2013年第11期84-86,共3页Computer Applications and Software
基 金:国家自然科学基金项目(61175055;61070213);浙江省自然科学基金项目(Y1080901)
摘 要:不良短信的泛滥,严重影响了社会风气,干扰了人们正常的生活秩序,研发不良短信过滤技术具有相当的实用价值。应用中科院计算所研制开发的ICTCLAS分词系统,结合TFIDF词权度量指标提取关键词,实现短信文本到特征向量的转换,然后采用kNN方法实现短信的类别判断,从而实现不良短信的过滤。另外,针对训练集分布不均衡的情况,应用基于密度的改进方法,较为有效地处理了原来分类结果倾向于大类别样本的情况。实验表明,改进后的方法的准确率约79.18%,比原方法提升了约1.23%。该方法能够比较有效地过滤不良短信,具有一定的实用价值。The overrunning of the unwanted short messages seriously impacts the social ethos and disrupts the normal life order of people. It has considerable practical value to research and develop the filtering technology of harmful short messages. In this paper, ICTCLAS segmentation system developed by the Institute of Computing Technology of CAS is applied to realist the transition of short message text to the eigenvectors in combination with keywords extraction using TFIDF word right metrics, then the kNN method is adopted to realise the discriminant of short messagescategories, thus the filtration of bad short messages is realised. In addition, according to the unbalanced distribution of training set, we apply the density-based improved method to solve the case of original classification results which are prone to the categories of big sample quite efficiently. Experiments show that the accuracy rate of the improved method reaches about 79. 18% , a 1.23% increase compared with the originalmethod. This method is able to more effectively filter the unwanted short messages, and has certain practical value.
关 键 词:短信过滤 不均衡训练集 k近邻方法 向量空间模型
分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28