检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《微电子学与计算机》2005年第8期93-95,共3页Microelectronics & Computer
基 金:国家自然科学基金资助项目(69982001);国家"863计划"资助项目(2001AA114201)
摘 要:基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准。传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优。本文提出了词相似度定义、词集合相似度定义,一种自下而上的分层聚类算法。这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,从而提高聚类的使用效果。Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity, and proposes a bottom-up hierarchical clustering algorithm based on similarity. This method not only improves clustering effect, but also can choice different similarity definition for different cluster-based model, such as predictive clustering, conditional clustering, and combined clustering, thus improved the effect of using clusters. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28