基于标签迭代的聚类集成算法  

Label iteration-based clustering ensemble algorithm

在线阅读下载全文

作  者:何玉林 杨锦 黄哲学 尹剑飞[2] HE Yulin;YANG Jin;HUANG Zhexue;YIN Jianfei(Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen,518107,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,518060,China)

机构地区:[1]人工智能与数字经济广东省实验室(深圳),广东深圳518107 [2]深圳大学计算机与软件学院,广东深圳518060

出  处:《智能科学与技术学报》2024年第4期466-479,共14页Chinese Journal of Intelligent Science and Technology

基  金:广东省自然科学基金面上项目(No.2023A1515011667);深圳市科技重大专项项目(No.202302D074);广东省基础与应用基础研究基金项目(No.2023B1515120020)。

摘  要:现有的“数据相同,算法不同”式的聚类集成算法训练策略普遍存在处理大规模数据性能受限以及共识函数适应性不强的缺点。为此,对“数据不同,算法相同”式的聚类集成算法训练策略进行了研究,构建了一种基于标签迭代的聚类集成(LICE)算法。首先,该算法在原始数据集的随机样本划分(RSP)数据块上训练若干基聚类器。接着,利用最大平均差异准则对聚类簇数相同的基聚类结果进行融合,并基于标签确定的RSP数据块训练一个启发式分类器。之后,迭代式地利用启发式分类器对标签不确定的RSP数据块中的样本点进行标签预测,利用分类标签与聚类标签一致的样本点强化启发式分类器的性能。最后,通过一系列可信的实验对LICE算法的可行性和有效性进行验证,结果显示在代表性数据集上,LICE算法对应的标准互信息、调整兰德系数、Fowlkes-Mallows指数以及纯度在第5次迭代时相比于迭代起始分别平均提升了17.23%、16.75%、31.29%和12.37%。与7种经典的聚类集成算法相比,在选用的数据集上,这4个指标的值分别平均提升了11.76%、16.50%、9.36%和14.20%。实验证实了LICE算法是一种高效合理的、能够处理大数据聚类问题的聚类集成算法。The existing training strategies for clustering ensemble algorithm are generally conducted based on the same data and different base clustering algorithms and commonly have the limitations of low performance for large-scale data and weak adaptability of consensus function.To address these problems,this paper proposed a label iteration-based clus‐tering ensemble(LICE)algorithm which was developed based on the training strategy for clustering ensemble algorithm of different data and same base clustering algorithm.Firstly,multiple base clusterings were trained based on the random sample partition(RSP)data blocks.Secondly,the base clustering results with same cluster numbers were fused with maxi‐mum mean discrepancy criterion and then a heuristic classifier was trained based on the RSP data blocks with labels.Thirdly,the sample points without labels were labeled with heuristic classifier which was iteratively enhanced with the la‐beled sample points having the consistent labeling for clustering and classification.Finally,a series of persuasive experi‐ments were conducted to validate the feasibility and effectiveness of LICE algorithm.The experimental results showed that the normalized mutual information,adjusted Rand index,Fowlkes-Mallows index and purity of LICE algorithm in‐creased by 17.23%,16.75%,31.29%,and 12.37%on average at the 5th iteration compared to the initial iteration and these four indexes increased by 11.76%,16.50%,9.36%,and 14.20%on average for the representative datasets in com‐parison with seven state-of-the-art clustering ensemble algorithms and thus demonstrate that LICE algorithm is an effi‐cient and reasonable clustering ensemble algorithm with the potential to handle large-scale data clustering problems.

关 键 词:聚类集成算法 集成学习 随机样本划分 最大平均差异 标签迭代 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象