基于距离和密度的PBK-means算法  被引量:2

PBK-means Algorithm Based on Distance and Density

在线阅读下载全文

作  者:魏文浩 唐泽坤 刘刚[1] WEI Wenhao;TANG Zekun;LIU Gang(School of Information Science and Engineering,Lanzhou University,Lanzhou 730000,China)

机构地区:[1]兰州大学信息科学与工程学院,兰州730000

出  处:《计算机工程》2020年第9期68-75,共8页Computer Engineering

基  金:中央高校基本科研业务费专项资金重点项目“基于大数据的城市公共安全风险预警研究”(17LZUJBWZD012);教育部哲学社会科学研究重大课题攻关项目“大数据驱动的城市公共安全风险研究”(16JZD023)。

摘  要:K-means算法初始中心点选择的随机性以及对噪声点的敏感性,使得聚类结果易陷入局部最优解,为获得最佳初始聚类中心,提出一种基于距离和密度的并行二分K-means算法。计算数据集的平均样本距离,根据数据点之间的距离计算数据的权重,选择最大权重数据点作为第一个中心点,小于平均样本距离的数据点不参加下一次聚类,将剩余数据点的权重与中心点距离相乘,选择值最大的数据点作为下一个中心点,得到两个中心点后按照距离对数据进行分配,将每个中心点代表的类分为两类后在每类上继续重复上述步骤。通过模仿细胞分裂的方法对数据进行切分,构建一棵满二叉树,当叶子结点数超过类别数k时停止聚类,合并叶子结点得到k个初始聚类中心执行K-means算法。在UCI公开数据集上进行测试,结果表明,对比传统K-means算法、Canopy-Kmeans算法、二分K-means算法、WK-means算法、MWK-means算法和DCK-means算法,该算法效率更高,具有较好的聚类效果。The randomness of the initial center point selection of the K-means algorithm and its sensitivity to noise points make the clustering result easily fall into the local optimal solution.In order to obtain the best initial clustering center,this paper proposes a parallel bisecting K-means algorithm based on distance and density.The algorithm calculates the average distance between dataset samples.Based on the distance between the data points,the weight of the data is calculated,and the most heavily weighted data is chosen as the first center point.Also,data whose distance from the first center point is less than the average sample distance do not participate in the next round of clustering.The weights of the remaining data points are multiplied with the distance from the selected center point,and the data with the largest value is chosen as the next center point.After the two center points are obtained,the data are distributed according to the distance from them.The classes represented by each center point are divided into two categories,and on each category the above steps are repeated.The algorithm simulates the way of cell division to segment the data,and constructs a full binary tree.When the number of leaf nodes exceeds the number of categories,k,the clustering is stopped and k initial clustering centers are obtained by merging leaf nodes to execute the K-means algorithm.Test results on the UCI public dataset show that the proposed algorithm has higher efficiency and better clustering performance compared with the traditional K-means algorithm,Canopy-Kmeans algorithm,Bisecting K-means algorithm,WK-means algorithm,MWK-means algorithm and DCK-means algorithm.

关 键 词:二分K-means算法 聚类中心 初始中心点 权重 数据挖掘 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象