基于主题词频数特征的文本主题划分  被引量:11

New text categorization method based on the frequency of topic words

在线阅读下载全文

作  者:康恺[1] 林坤辉[1] 周昌乐[2] 

机构地区:[1]厦门大学软件学院,福建厦门361005 [2]厦门大学信息科学与技术学院,福建厦门361005

出  处:《计算机应用》2006年第8期1993-1995,共3页journal of Computer Applications

基  金:厦门大学985二期信息创新平台项目资助(0000-X07204)

摘  要:目前文本分类所采用的文本—词频矩阵具有词频维数过大和过于稀疏两个特点,给计算造成了一定困难。为解决这一问题,从用户使用搜索引擎时选择所需文本的心理出发,提出了一种基于主题词频数特征的文本主题划分方法。该方法首先根据统计方法筛选各文本类的主题词,然后以主题词类替代单个词作为特征采用模糊C-均值(FCM)算法施行文本聚类。实验获得了较好的主题划分效果,并与一种基于词聚类的文本聚类方法进行了过程及结果中多个方面的比较,得出了一些在实施要点和应用背景上较有意义的结论。The word frequency matrix currently used in text categorization is characterized with high dimensionality and excessive sparsity. These two features caused some difficulties to computing. To solve this problem, according to the search engine users' selections, a new text categorization method based upon the feature of topic words frequency was proposed. This approach was designed to filter new concept topic words by statistical method, and then the FCM clustering algorism was applied to the documents, using the frequency of topic words rather than the frequency of single word as the feature. This method performs well in the experiment. Furthermore, this method was compared in many aspects with a text categorization method based on keyword qlusters, and some useful conclusions about implementation and application were reached.

关 键 词:搜索引擎 文本聚类 模糊C-均值 主题词筛选 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象