Web文档分类中TFIDF特征选择算法的改进  被引量:4

Improvement of TFIDF Feature Selection Algorithm in Web Document Classification

在线阅读下载全文

作  者:段国仑 谢钧 郭蕾蕾 王晓莹 DUAN Guo-lun;XIE Jun;GUO Lei-lei;WANG Xiao-ying(School of Command Control Engineering,Army Engineering University of PLA,Nanjing 210007,China;School of Communications Engineering,Army Engineering University of PLA,Nanjing 210007,China)

机构地区:[1]陆军工程大学指挥控制工程学院,江苏南京210007 [2]陆军工程大学通信工程学院,江苏南京210007

出  处:《计算机技术与发展》2019年第5期49-53,共5页Computer Technology and Development

基  金:国家自然科学基金(61101202)

摘  要:随着海量数据资源在网络中的出现,Web文档分类技术越来越受到重视。在Web文档分类的研究中,特征选择算法有着重要的研究意义。特征选择能有效降低文本向量空间模型的维度,从而构造出更快,消耗更低的预测模型。传统的TFIDF算法仅仅依靠文档中所包含特征词的词频和逆文档频率来判断该特征词对于文档分类的重要性,忽略了特征项在类内和类间的分布以及数据集不均衡现象,从而效果受到制约。针对存在的不足进行改进,提出了类内分布因子以及类间分布因子。基于类内以及类间因子,替代逆文档频率,可以使得改进的表达式能够选择出更加高效的特征词。通过使用SVM分类器进行文本分类对比实验,与改进前的方法相比,该方法能使F_1值得到一定程度的提高,在不均衡数据集上同样具有较好的分类效果。With the emergence of massive data resources in the network,Web document classification technology has received more and more attention.In the research of Web document classification,feature selection algorithm has important research significance.Feature selection can effectively reduce the dimensions of the text vector space model,so as to construct a prediction model that is faster and costs less.The traditional TFIDF algorithm only depends on the word frequency and inverse document frequency of the feature words contained in the document to judge the importance of the feature word for document classification,ignoring the distribution of feature items within and between classes and the imbalance of data sets.The effect is limited.In order to improve the existing deficiencies,intra-class distribution factors and inter-class distribution factors were proposed.Based on intra-and inter-class factors,instead of inverse document frequency,improved expressions can be selected for more efficient feature words.By using the SVM classifier for text classification and comparison experiments,this method can increase the F 1 value to a certain extent,and also has better classification effect on the unbalanced data set.

关 键 词:WEB文档分类 特征选择 TFIDF算法 SVM 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象