面向海量文档集的分布式索引构建方法  

A Distributed Indexing Method of Large Scale Document Set

在线阅读下载全文

作  者:王万乐 石冰[1] 陈驰[2] 

机构地区:[1]山东大学计算机科学与技术学院,济南250101 [2]中国科学院信息工程研究所,信息安全国家重点实验室,北京100093

出  处:《网络新媒体技术》2016年第5期22-27,39,共7页Network New Media Technology

摘  要:Kmeans聚类算法是分布式索引构建中比较有效的文档分割方法。然而,基于单节点Kmeans算法的索引构建方法在应用于海量数据时存在两个问题:初始中心点的选取对于聚类结果的影响较大,聚类结果不稳定;聚类节点容易成为系统运行的瓶颈、文档集合的可扩展性差。针对上述问题,提出一种基于可并行的优化Kmeans算法的索引构建方法,基于样本聚类优化算法初始点的选择,保证聚类结果的稳定性,优化索引分布;同时将聚类的过程并行化,消除系统瓶颈,提升系统效率。实验表明,该方法在索引构建效率和查询结果的准确性方面均较传统方法有显著提升。Kmeans algorithm is an effective method of splitting document set in distributed indexing. However, the distributed indexing method based on single - point - Kmeans has two problems while applying for large scale data : (1)The start points play a very influential role to the clustering result and the result is unstable. (2)The node for clustering is easily to be the bottleneck of the whole system and the extendibility of the document set is poor. To solve the problems above, proposes a distributed indexing method based on the parallel optimized Kmeans algorithm. This method optimized the chosen of the initial point based on sample clustering, ensuring the stable of clustering result, optimizing the index distribution; In the meanwhile, this method can deserialize the process, removing the bottleneck of the system to improving the efficiency. The experiments shows that, compared with traditional methods, this method has great im- provement in the efficiency of building the index as well as the accuracy of querying result.

关 键 词:聚类 Kmeans算法 MapReduce计算模型 分布式索引 

分 类 号:TP391.3[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象