检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:段国仑 谢钧 郭蕾蕾 王晓莹 DUAN Guo-lun;XIE Jun;GUO Lei-lei;WANG Xiao-ying(School of Command Control Engineering,Army Engineering University of PLA,Nanjing 210007,China;School of Communications Engineering,Army Engineering University of PLA,Nanjing 210007,China)
机构地区:[1]陆军工程大学指挥控制工程学院,江苏南京210007 [2]陆军工程大学通信工程学院,江苏南京210007
出 处:《计算机技术与发展》2019年第5期49-53,共5页Computer Technology and Development
基 金:国家自然科学基金(61101202)
摘 要:随着海量数据资源在网络中的出现,Web文档分类技术越来越受到重视。在Web文档分类的研究中,特征选择算法有着重要的研究意义。特征选择能有效降低文本向量空间模型的维度,从而构造出更快,消耗更低的预测模型。传统的TFIDF算法仅仅依靠文档中所包含特征词的词频和逆文档频率来判断该特征词对于文档分类的重要性,忽略了特征项在类内和类间的分布以及数据集不均衡现象,从而效果受到制约。针对存在的不足进行改进,提出了类内分布因子以及类间分布因子。基于类内以及类间因子,替代逆文档频率,可以使得改进的表达式能够选择出更加高效的特征词。通过使用SVM分类器进行文本分类对比实验,与改进前的方法相比,该方法能使F_1值得到一定程度的提高,在不均衡数据集上同样具有较好的分类效果。With the emergence of massive data resources in the network,Web document classification technology has received more and more attention.In the research of Web document classification,feature selection algorithm has important research significance.Feature selection can effectively reduce the dimensions of the text vector space model,so as to construct a prediction model that is faster and costs less.The traditional TFIDF algorithm only depends on the word frequency and inverse document frequency of the feature words contained in the document to judge the importance of the feature word for document classification,ignoring the distribution of feature items within and between classes and the imbalance of data sets.The effect is limited.In order to improve the existing deficiencies,intra-class distribution factors and inter-class distribution factors were proposed.Based on intra-and inter-class factors,instead of inverse document frequency,improved expressions can be selected for more efficient feature words.By using the SVM classifier for text classification and comparison experiments,this method can increase the F 1 value to a certain extent,and also has better classification effect on the unbalanced data set.
关 键 词:WEB文档分类 特征选择 TFIDF算法 SVM
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.222.251.131