检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]山东大学计算机科学与技术学院,济南250101 [2]中国科学院信息工程研究所,信息安全国家重点实验室,北京100093
出 处:《网络新媒体技术》2016年第5期22-27,39,共7页Network New Media Technology
摘 要:Kmeans聚类算法是分布式索引构建中比较有效的文档分割方法。然而,基于单节点Kmeans算法的索引构建方法在应用于海量数据时存在两个问题:初始中心点的选取对于聚类结果的影响较大,聚类结果不稳定;聚类节点容易成为系统运行的瓶颈、文档集合的可扩展性差。针对上述问题,提出一种基于可并行的优化Kmeans算法的索引构建方法,基于样本聚类优化算法初始点的选择,保证聚类结果的稳定性,优化索引分布;同时将聚类的过程并行化,消除系统瓶颈,提升系统效率。实验表明,该方法在索引构建效率和查询结果的准确性方面均较传统方法有显著提升。Kmeans algorithm is an effective method of splitting document set in distributed indexing. However, the distributed indexing method based on single - point - Kmeans has two problems while applying for large scale data : (1)The start points play a very influential role to the clustering result and the result is unstable. (2)The node for clustering is easily to be the bottleneck of the whole system and the extendibility of the document set is poor. To solve the problems above, proposes a distributed indexing method based on the parallel optimized Kmeans algorithm. This method optimized the chosen of the initial point based on sample clustering, ensuring the stable of clustering result, optimizing the index distribution; In the meanwhile, this method can deserialize the process, removing the bottleneck of the system to improving the efficiency. The experiments shows that, compared with traditional methods, this method has great im- provement in the efficiency of building the index as well as the accuracy of querying result.
关 键 词:聚类 Kmeans算法 MapReduce计算模型 分布式索引
分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:13.59.22.238