用于文本分类的特征项权重算法改进  被引量:9

Improvement of Algorithm for Weight of Characteristic Item in Text Classification

在线阅读下载全文

作  者:龚静[1] 胡平霞[1] 胡灿[1] 

机构地区:[1]湖南环境生物职业技术学院信息技术系,湖南衡阳421005

出  处:《计算机技术与发展》2014年第9期128-132,共5页Computer Technology and Development

基  金:湖南省教育科技计划项目(07D036);湖南省教育厅;财政厅联合资助项目(12C1056)

摘  要:TF-IDF算法是文本分类中一种常用的权重计算方法,但是TF-IDF仅仅考虑了特征项在文本中出现的次数以及该特征项在训练集中的出现频率,没有考虑特征项在各个类间的分布情况及特征项的语义信息。因此针对TF-IDF的不足提出了一种改进的TF-IDF算法,此算法既考虑了特征项在类内的分布情况又考虑了特征项的位置及长度等语义因素,能更好地反映特征项的重要性。用朴素贝叶斯分类器验证其有效性,实验结果表明该算法优于TF-IDF算法,能较好地提高文本分类的准确率。TF-IDF algorithm is a commonly used method of calculating weight in text classification,but TF-IDF considers only occurrence of feature in the text, as well as the frequency of characteristic appearing in the training set, and does not take into the distribution of characteristics in each class and the semantic information of characteristics account. In order to solve this problem, the improved TF-IDF algorithm has been proposed which considers not only the distribution condition of feature in class, but also the semantic factors such as the position of the feature, length of the feature. This algorithm can better reflect the importance of feature item, and its validity is verified by Naive Bayes classifier. The experiment results show that the proposed algorithm outperforms the TF-IDF algorithm,and the algorithm can improve the accuracy of text classification well.

关 键 词:文本分类 特征项 权重 改进 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象