基于词典中词语量化关系的中文文本聚类研究  被引量:1

Research on quantified lexical relationship within the dictionary based chinese text clustering

在线阅读下载全文

作  者:胡熠[1] 陆汝占[1] 陈玉泉[1] 刘慧[1] 

机构地区:[1]上海交通大学计算机科学与工程系,上海200240

出  处:《高技术通讯》2007年第8期778-782,共5页Chinese High Technology Letters

基  金:863计划(2001AA114210-11)和国家自然科学基金(60496326)资助项目.

摘  要:鉴于词语知识对提高文本聚类性能的价值,提出了一种用线性插值方式把词典词语之间的量化关系和余弦相似度结合起来的文本相似度计算方法.在实现文本聚类之前,基于词典中一个词条和其释义在语义上等价的假设,构建出词条和释义中词语之间的量化关系,并把这种量化关系值作为文本聚类用到的知识.在k-均值聚类算法的框架下,这种以线性插值方式构造的新的相似度,给文本聚类系统性能带来了明显的提高.实验结果说明从词典中获取的词语量化关系对将来的文本聚类研究可能会有潜在的贡献.In consideration of the usefulness of the lexical knowledge in improving text clustering, we presented a new text similarity measure on the basis of combining cosine similarity with the quantified lexical relationship within the dictionary by using linear interpolation. Before the implementation of text clustering, the quantified relationship between a dictionary entry and the words in its definition was constructed under the assumption that the entry and its definition were equivalent in sense. This kind of quantified relationship was regarded as knowledge and was used in text clustering. Under the framework of the k-means algorithm, the new similarity measure constructed by linear interpolation improved the performance of text clustering system significantly. The experimental result shows that the relationship knowledge derived from an ordinary dictionary has potential contribution to the text clustering in the future.

关 键 词:文本聚类 词语量化关系 线性插值 K-均值 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象