XML文档聚类中基于语义的特征词权重计算方法

Term weighting approach based on semantic about XML clustering

出　　处：《长沙理工大学学报（自然科学版）》2015年第2期72-77,共6页Journal of Changsha University of Science and Technology:Natural Science

基　　金：国家自然科学基金资助项目(61303043)

摘　　要：在XML文档检索中,结果聚类是一种改善检索效果的有效方法,其文档距离度量是影响聚类质量的关键因素。针对XML文档检索结果聚类中TF×IDF方法的频率因子和长度因子处理上的不合理和不能突显重要词条的缺点,提出了一种基于"频率因子"和"长度因子"的新权重方案。并在建立向量空间模型时引入LSI理论,在词条之间搭建了语义关系,减少了原词-文档矩阵中包含的噪声,聚类速度和精度都有所提高。在IEEE无类别信息数据集上试验表明,与同类相似度计算方法和聚类方法相比,本研究方法在聚类速度和效果上都有所提高和改善。Clustering XML search results is an effective way to improve performance. The key factor affecting the quality of the clustering is how to measure distance between XML documents. In view of term weighting algorithms, TF-IDF, about clustering search results which is unreasonable to make use of linear and unable to emphasize the significance of key term which contribute mainly to the content of a text, a new weighting design based on fre- quency factor and length factor was proposed.LSI is performed to discover a new low-dimen- sional semantic space, in which the semantic relationship between features is strengthened while the noisy features in the original space are eliminated, and has improved speed and preciseness. Experiment results on IEEE unclassified corpus show that, compared with sim- ilar similarity calculation methods and clustering methods, the method in this paper has in- creased the speed and effectiveness.

关键词：潜在语义索引检索结果聚类权重算法聚类算法

分类号：TP311.13[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

XML文档聚类中基于语义的特征词权重计算方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

XML文档聚类中基于语义的特征词权重计算方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索