一种基于近似EMD的DBSCAN改进算法  被引量:5

An improved DBSCAN algorithm based on the approximate EMD

在线阅读下载全文

作  者:张宏兵[1] 陆建峰[1] 汤九斌[2] 

机构地区:[1]南京理工大学计算机科学技术学院,江苏南京210094 [2]中国电信江苏公司,江苏南京210037

出  处:《山东大学学报(工学版)》2012年第4期35-40,共6页Journal of Shandong University(Engineering Science)

基  金:江苏省自然基金资助项目(BK2009489);江苏省青蓝工程资助项目

摘  要:DBSCAN(density-based spatial clustering of applications with noise)算法是基于密度的经典聚类算法,但是该算法应用于高维数据时,常用距离函数不能很好地反映出数据点之间的关系,从而可能导致聚类簇不够精确。如果能在高维空间中采用合适的距离度量,将会改善聚类结果。针对上述问题,提出利用近似EMD(earth mover’sdistance,堆土机距离)作为距离测度,通过迭代搜索的方法找出所有直接密度可达对象实现聚类。实验结果表明:在高维文本数据的聚类中,和原来算法相比,改进算法的正确率提高了6%,两者在时间上相差不大;而对低维的Iris数据,改进算法通过EMD改善了实体间的相似性度量,减少了划分为噪声点的数据点个数,平均正确率提高了10%。实验结果表明了改进算法对高维数据的有效性,并可以改善聚类性能。The DBSCAN algorithm is one of the classic clustering algorithms based on the density. When this algorithm was applied to high-dimensional data, the distance measures in common use could not reflect the relationships between instances well, which would lead to the inaccurate clustering. If appropriate distance measures were adopted in high-di- mensional space, the clustering result would be improved. To solve the above problem, the approximate EMD (earth mover's distance) instead of the common distance was used as the distance measure, and the clustering was achieved by finding all density-reachable objects with the method of iterative search. The experimental results showed that the per- formance of improved algorithm was 6% higher than that of the original algorithm for the high-dimensional text cluste- ring, while there is no obvious difference in time cost. For low-dimensional Iris data, the proposed algorithm could im- prove the similarity measure between the instances, reduce the number of data points classified as noise points, and boot the performance with 10%. The experimental results also indicated that the proposed algorithm could reveal its effectiveness for high-dimensional data, and could improve the clustering performance.

关 键 词:聚类 DBSCAN算法 近似EMD 高维数据 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象