检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]中南大学信息科学与工程学院,长沙410083
出 处:《计算机科学与探索》2016年第9期1299-1309,共11页Journal of Frontiers of Computer Science and Technology
基 金:国家自然科学基金No.61379109;高等学校博士学科点专项科研基金No.20120162110077~~
摘 要:随着文本数据量变得很大且仍在迅猛增加,自动文本分类变得越来越重要。为了提高分类准确率,作为文本特征的词的权重计算方法是文本分类领域的研究热点之一。研究发现,基于信息熵的权重计算方法(熵加权)相对于其他方法更有效,但现有方法仍然存在问题,比如在某些语料库上相比TF-IDF(term frequency&inverse document frequency),它们可能表现较差。于是将对数词频与一个新的基于熵的类别区分力度量因子相结合,提出了LTF-ECDP(logarithmic term frequency&entropy-based class distinguishing power)方法。通过在Tan Corp、Web KB和20 Newsgroups语料库上使用支持向量机(support vector machine,SVM)进行一系列文本分类实验,验证和比较了8种词权重计算方法的性能。实验结果表明,LTF-ECDP方法比其他熵加权方法和TF-IDF、TF-RF(term frequency&relevance frequency)等著名方法更优越,不仅提高了文本分类准确率,而且在不同数据集上的性能更加稳定。As the volume of textual data has become very large and is still increasing rapidly, automatic text categorization(TC) is becoming more and more important. Term weighting or feature weight calculation is one of the hot research topics in TC to improve the classification accuracy. It is found that entropy-based weighting(EW) methods are usually more effective than others. However, there are still some problems with the existing EW methods, e.g., they may perform worse than the traditional TF-IDF(term frequency & inverse document frequency), for TC on some text corpora. So this paper proposes a new term weighting scheme called LTF-ECDP, which combines logarithmic term frequency and entropy-based class distinguishing power as a new weighting factor. In order to test LTP-ECDP and compare it with other weighting methods, a considerable number of TC experiments using support vector machine(SVM)have been done on three popular benchmark datasets including a Chinese corpus, Tan Corp, and two English corpora such as Web KB and 20 Newsgroups. The experimental results show that LTF-ECDP outperforms the other five entropybased weighting methods and two famous methods such as TF-IDF and TF-RF(term frequency & relevance frequency).Compared with the other term weighting methods, LTF-ECDP can further improve the accuracy of TC while keeping good performance on different datasets consistently.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.249