检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]温州大学瓯江学院,浙江温州325035 [2]温州信息化研究中心,浙江温州325035 [3]湖北文理学院数学与计算机科学学院,湖北襄阳441053 [4]西南大学逻辑与智能研究中心,重庆400715 [5]浙江传媒学院新媒体学院,杭州310018
出 处:《计算机应用》2015年第11期3130-3134,共5页journal of Computer Applications
基 金:浙江省自然科学基金资助项目(LY13F010005);教育部人文社会科学研究项目(15YJAZH015);湖北省科技支撑计划软科学项目(2015BDH109);温州市科技计划项目(R20130021)
摘 要:针对传统文本聚类中存在着聚类准确率和召回率难以平衡等问题,提出了一种基于R-Grams文本相似度计算方法的文本聚类方法。该方法首先通过将待聚类文档降序排列,其次采用R-Grams文本相似度算法计算文本之间的相似度并根据相似度实现各聚类标志文档的确定并完成初始聚类,最后通过对初始聚类结果进行聚类合并完成最终聚类。实验结果表明:聚类结果可以通过聚类阈值灵活调整以适应不同的需求,最佳聚类阈值为15左右。随着聚类阈值的增大,各聚类准确率增大,召回率呈现先增后降的趋势。此外,该聚类方法避免了大量的分词、特征提取等繁琐处理,实现简单。Focusing on the issue that the clustering accuracy rate and recall rate are difficult to balance in traditional text clustering algorithms, a clustering approach based on the R-Grams text similarity computing algorithm was proposed. Firstly, the clustered documents were sorted in descending order; secondly, the symbolic documents were identified and then initial clustering results were achieved by using an R-Grams-based similarity computing algorithm; finally, the final clustering results were completed by combining the initial clustering. The experimental results show that the proposed approach can flexibly regulate the clustering results by adjusting the clustering threshold parameter to satisfy different demands and the optimal parameter is about 15. With the increasing of the clustering threshold, the clustering accuracies increase, and the recalls increase at first, then decrease. In addition, the approach is free from time-consuming processing procedures such as word segmentation and feature extraction and can be easily implemented.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.74