基于相似度的词聚类算法  被引量:4

Word Clustering Based on Similarity

在线阅读下载全文

作  者:袁里驰[1] 钟义信[1] 

机构地区:[1]北京邮电大学信息工程学院,北京100876

出  处:《微电子学与计算机》2005年第8期93-95,共3页Microelectronics & Computer

基  金:国家自然科学基金资助项目(69982001);国家"863计划"资助项目(2001AA114201)

摘  要:基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准。传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优。本文提出了词相似度定义、词集合相似度定义,一种自下而上的分层聚类算法。这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,从而提高聚类的使用效果。Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity, and proposes a bottom-up hierarchical clustering algorithm based on similarity. This method not only improves clustering effect, but also can choice different similarity definition for different cluster-based model, such as predictive clustering, conditional clustering, and combined clustering, thus improved the effect of using clusters. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.

关 键 词:词相似度 词聚类 统计语言模型 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象