符号序列的概率向量聚类方法  

Clustering method for symbolic sequences using probability vectors

在线阅读下载全文

作  者:程铃钫[1] 陈黎飞[2] Cheng Lingfang;Chen Lifei(Jinshan College of Fujian Agriculture & Forestry University,Fuzhou 350002,China;School of Mathematics & Computer Science,Fujian Normal University,Fuzhou 350117,China)

机构地区:[1]福建农林大学金山学院,福州350002 [2]福建师范大学数学与计算机科学学院,福州350117

出  处:《计算机应用研究》2018年第6期1676-1680,共5页Application Research of Computers

基  金:国家自然科学基金资助项目(61672157)

摘  要:针对符号序列聚类中表示模型及序列间距离度量定义的困难问题,提出一种基于概率向量的表示模型及基于该模型的符号序列聚类算法。该模型引入符号序列的概率分布表示法,定义了一种基于概率分布差异的符号序列距离度量及该模型的目标函数,最后给出了一种符号序列K-均值型聚类算法,并在来自不同领域的实际应用序列集上进行了实验验证。实验结果表明,与基于子序列表示模型的符号序列聚类算法相比,所提方法在DNA序列和语音序列等具有较多符号的实际数据上,在有效提高聚类精度的同时降低聚类时间50%以上。This paper proposed a representation model using probability vectors of symbolic sequences and a new clustering algorithm based on the model,to address the difficult problems in defining an efficient representation as well as a meaningful distance measure for symbolic sequences clustering. It proposed a probability-distribution-based representation method for symbolic sequences,on which first defined a new distance measure computed on the dissimilarity of the probability distributions,and also defined a clustering criterion for sequences clustering with the probability vector space model. Finally,it described a Kmeans-type algorithm for symbolic sequences clustering,and conducted a series of experiments on real-world sequence sets from various domains to evaluate its performance. The experimental results show that,on both gene sequences and speech sequences consisting of a relatively large number of symbols,the proposed method improves the clustering accuracy effectively with more than 50% decrease in the clustering time,compared with the existing algorithms using a subsequence-based representation model.

关 键 词:数据聚类 符号序列 向量空间模型 概率向量 马尔可夫模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象