一种基于Seeds集和成对约束的半监督聚类算法  被引量:7

A semi-supervised clustering algorithm based on seeds and pair-wise constraints

在线阅读下载全文

作  者:常瑜[1,2] 梁吉业[1,2] 高嘉伟[1,2] 杨静[1,2] 

机构地区:[1]山西大学计算机与信息技术学院,太原030006 [2]计算智能与中文信息处理教育部重点实验室,太原030006

出  处:《南京大学学报(自然科学版)》2012年第4期405-411,共7页Journal of Nanjing University(Natural Science)

基  金:国家自然科学基金(71031006;70971080);国家"973"计划前期研究专项课题(2011CB311805);高等学校博士学科点专项科研基金(20101401110002)

摘  要:半监督聚类研究如何利用少量的监督信息来提高聚类性能,目前已经成为机器学习领域的一个研究热点.现有的大多数半监督聚类方法没有综合考虑Seeds集和成对约束这两种监督信息,因而提出了一种基于Seeds集和成对约束的半监督聚类算法.该算法运用Tri-training算法扩充Seeds集,结合成对约束优化Seeds集并指导聚类过程.实验结果表明,该算法能够有效提高聚类性能.Abstract:Semi-supervised learning, a kind of application-driven machine learning method, has become one of the hot topics of artificial intelligence and pattern recognition. As the main branch of semi-supervised learning, semi- supervised clustering gives a small amount of supervision information into the search process of optimal clustering. Recently, kinds of semi-supervised clustering algorithms are proposed, such as methods based on search, methods based on similarity, methods based on search and similarity. However, most current semi-supervised clustering algorithms don't use valuable seeds and pair-wise constraints at the same time. Therefore, a semi-supervised clustering algorithm based on seeds and pair-wise constraints is introduced, in order to make full use of given supervision information. In addition, Tri-training algorithm is a representative method based on Co-training mechanism. Considering that Tri-training algorithm can use three classifiers to label unlabeled samples, the proposed algorithm will utilize it to get more labeled samples. Firstly, based on Tri-training method, some unlabeledsamples are selected and annotated, to enlarge the number of initial labeled samples. Secondly, pair wise constraints are utilized to optimize enlarged labeled samples, with the purpose of improving its quality. Thirdly, initial clustering centers are acquired by optimized labeled samples. Finally, K-Means algorithm is carried out, and in the search process, pair-wise constraints are used to modify the partitioning results each time. Furthermore the proposed algorithm is compared with K-Means, Seeded-K-Means and COP-K-Means algorithm. And experimental results on three UCI data sets in same setting demonstrate that this method can take full advantage o{ given supervision information and get a better clustering result. Moreover, the experiment in Haberman data set is conducted to analyze relative impact on the algorithm's performance of pair-wise constraints and labeled samples numbers. Experimental results

关 键 词:半监督聚类 Seeds集 成对约束 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象