基于词频类别相关的特征权重算法  被引量:6

Categories-related term weighting method based on term frequency

在线阅读下载全文

作  者:张羚[1] 陆余良[1] 杨国正[1] 

机构地区:[1]电子工程学院网络系,合肥230037

出  处:《计算机应用研究》2017年第2期386-391,共6页Application Research of Computers

摘  要:在文本分类领域中,目前关于特征权重的研究存在两方面不足:一方面,对于基于文档频率的特征权重算法,其中的文档频率常常忽略特征的词频信息;另一方面,对特征与类别的关系表达不够准确和充分。针对以上不足,提出一种新的基于词频的类别相关特征权重算法(CDF-AICF)。该算法在度量特征权重时,考虑了特征在每个词频下的文档频率。同时,为了准确表达特征与类别的关系,提出了两个新的概念:类别相关文档频率CDF和平均逆类频率AICF,分别用于表示特征对类别的表现力和区分力。最后,通过与其他五个特征权重度量方法相比较,在三个数据集上进行分类实验,结果显示,CDF-AICF的分类性能优于其他五种度量方法。In the field of automatic text classification, previous studies related to different term weighting had some deficien- cies. On the one hand, for term weighting algorithm based on document frequency, term frequency is normally ignored in cal- culating document frequency. On the other hand, the expression of the relationship between the terms and the categories is not accurate and adequate. This paper developed a novel term weighting related categories based on term frequency (CDF-AICF). The algorithm took document frequencies for each term count of a term into account while measuring the term weight. In order to accurately express the relationship between terms and categories, this paper proposed two new concepts i. e. , were docu- ment frequency related to category(CDF) and average inverse class frequency(AICF) respectively, used to reflect the expres- sive ability of term and the distinguishing ability of term. Finally, comparing with five related different term weighting approa- ches on three datasets, the performance of CDF-AICF is superior than the other five approaches.

关 键 词:文本分类 文本表示 特征权重 文档频率 逆类频率 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象