Clustering in Very Large Databases Based on Distance and Density  被引量:14

在线阅读下载全文

作  者:钱卫宁 宫学庆 周傲英 

机构地区:[1]Department of Computer Science and Engineering, The Laboratory for Intelligent Information Processing Fudan University, Shanghai 200433, P.R. China

出  处:《Journal of Computer Science & Technology》2003年第1期67-76,共10页计算机科学技术学报(英文版)

基  金:国家重点基础研究发展计划(973计划),高等学校博士学科点专项科研项目,Microsoft Research Fellowship

摘  要:Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g., R*-tree); 2) pre-specifying the uncertain parameter k, with which clustering can only be refined by trial and test many times; 3) lacking high efficiency in treating arbitrary shape under very large data set environment. In this paper, we first present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies,can handle any arbitrary shape clusters effectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore,this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of efficiency and cost, and even gets much more speedup as the data size scales up much larger.

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论] P208[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象