中文文本分类中基于概念屏蔽层的特征提取方法  被引量:12

A Feature Selection Method in Chinese Text Classification Based on Concept Extraction with a Shielded Level

在线阅读下载全文

作  者:廖莎莎[1] 江铭虎[1] 

机构地区:[1]清华大学人文学院计算语言实验室清华大学认知科学创新基地,北京100084

出  处:《中文信息学报》2006年第3期22-28,共7页Journal of Chinese Information Processing

基  金:教育部优秀青年教师资助计划项目(2051);中国科学院模式识别国家重点实验室开放课题基金(10);2003年度清华大学985-Ⅰ期基础研究基金的资助

摘  要:本文提出了一种新的基于概念抽取和屏蔽层的特征选择方法。该方法利用HowNet概念词典中的概念树,通过义原在概念树中的位置信息进行概念抽取,并赋予其适当权值来说明其描述能力。对于权值低于屏蔽层的义原,我们不将其选入特征集,并相应保留原词。具体到每个词,我们计算其DEF条目中的权值,决定是将原词选入特征集还是进行概念抽取。本文重点研究了如何给义原设定一个合适的权值,如何在选取原词和概念之间取得平衡以及针对非概念词的加权处理。实验证明,设定合适的屏蔽层,不仅可以缩小特征维数,使分类正确率得到一定的提高,而且可以减少不同类别间的分类正确率的差别。In this paper,we propose a novel feature selection method based on concept extraction and shielded level. In this method, we use HowNet as the semantic dictionary to extract concept attributes. Based on their positions in the concept tree, the attributes will get proper weights, which present their description powers. A concept attribute will not be selected as feature if its weight is lower than the shielded level and the original word will be reserved for use. To each word, we calculate all tbe weights of the concept attributes in its DEF, and decide whether to extract the concept attributes or reserve the word. We focus mainly on bow to weight the concept attributes, how to make a balance between concept features and word features, and how to treat the words out of the dictionary. The experiment shows that if a shielded level is set properly, it can not only reduce the feature dimension to a proper scale but also improve the classification precise. Moreover, it can reduce the difference of the classification precise among different categories.

关 键 词:计算机应用 中文信息处理 文本分类 特征提取 概念抽取 属性特征树 屏蔽层 描述能力 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象