检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]济南大学信息科学与工程学院,济南250022
出 处:《计算机应用研究》2008年第4期986-988,共3页Application Research of Computers
基 金:国家自然科学基金资助项目(60573065);国家"863"计划资助项目(2002AA4Z3240);教育部的世行贷款--21世纪初高等教育教学改革资助项目(1283B0843)
摘 要:K-均值聚类算法是目前一种较好的文本分类算法,算法中的相似度计算通常基于词频统计,小文档或简单句子由于词频过小,使用该算法聚类效果较差。为此,提出了一种基于词语关联度的相似度计算算法,对简单文档集执行关联规则算法,得出基于关键词的关联规则,并根据这些规则求得词语关联度矩阵,然后由权重对文本进行文本特征向量表示,最后借助于关联度矩阵和文本特征向量,并按一定算法计算出句子相似度。实验证明该算法可得到较好的聚类结果,且其不仅利用词频统计的方法而且考虑了词语间的关系。K-means clustering algorithm is a kind of better text categorization algorithm. Its similarity calculation is based on the word frequency statistics. Because the word frequency of short or simple document is low, result of the K-means clustering method is not desirable. To solve above mentioned problems, put forward a kind of K-means text clustering method based on association value of words. Firstly, conducted the association rule algorithm on the short document sets to get the association rules about key words. Got the matrix about words' association by using the key words association rule. Secondly, expressed text eigenvector by weight of words in the document. Finally, according to the matrix about words' association and text eigenvector expressing, got the similarity value of documents by certain algorithm. Experiment shows that it can get the efficient clustering results. Not only applies the frequency of words in this method, but also consider the association of words.
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229