基于模糊关系的文本分类特征选择方法  被引量:1

Feature Selection Based on Fuzzy Relation for Text Categorization

在线阅读下载全文

作  者:甄志龙[1,2] 韩立新[1] 陆佃龙[1] 

机构地区:[1]河海大学计算机及信息工程学院,南京210098 [2]通化师范学院计算机科学系,通化134002

出  处:《情报学报》2008年第6期851-856,共6页Journal of the China Society for Scientific and Technical Information

基  金:基金项目:国家自然科学基金资助项目(No.60673186和60571048).

摘  要:若要有效地实现文本分类,关键是对高维特征空间进行降维,降维方法分为特征选择和特征提取。本文对已有特征选择方法分析后发现,这些方法仅利用文档数来选择特征,没有考虑特征项的权重。为了找出本质特征,我们提出了一种基于特征项与类之间模糊关系的特征选择方法,引入特征项权重来确定其隶属度。采用KNN分类器,在Reuters-21578标准文本数据集上进行了训练和测试。实验表明,宏平均和微平均都达到了最高,分别为81.82%和94.88%,宏平均比IG,CHI提高了4.73%和1.12%,微平均比IG,CHI提高了1.56%和0.21%。For the effective implementation of text categorization, the key step is dimensionality reduction for highdimensional feature space, including feature selection and feature extraction. In the paper, after the previous methods of feature selection analyzed, they used only a few document numbers to choose features, while not using term weights. To discover the essential features through full advantage of term weights, training samples and classes, a method of feature selection based on fuzzy relation between terms and classes is proposed. The degree of membership is determined through term weights. Using K-Nearest Neighbor classifier, experimental results on the corpus of Reuters-21578 show that the proposed method is the best, 81.82% and 94.88% respectively on Macro-F1 and Micor-F1. It is increased 4.73% and 1.12% on Macro-F1, 1.56% and 0.21% on Micro-F1 than IG, CHI.

关 键 词:文本分类 特征项权重 模糊关系 特征选择 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术] TP391[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象