采用类别相似度聚合的关联文本分类方法  被引量:8

Associative Rule-Based Text Categorization Method Using Category Similarity

在线阅读下载全文

作  者:田丰[1,2] 桂小林[1,2] 杨攀[1,2] 王刚[1,2,3] 郭岳龙[1,2] 

机构地区:[1]西安交通大学电子与信息工程学院,西安710049 [2]西安交通大学陕西省计算机网络重点实验室,西安710049 [3]西安财经学院信息学院,西安710100

出  处:《西安交通大学学报》2012年第12期6-11,122,共7页Journal of Xi'an Jiaotong University

基  金:国家自然科学基金资助项目(60873071;61172090);国家高技术研究发展计划重大专项资助项目(2012ZX03002001-004)

摘  要:针对基于关联规则的分类方法在分类时仅考虑规则的置信度并使用规则修剪技术,导致分类器的分类精度难以进一步提高的问题,提出了一种基于类别相似度聚合的关联文本分类方法.该方法采用修改的χ2统计技术提取各类别的特征词;为保证规则匹配的精度和速度,使用CR-tree存储分类规则,并给出了CR-tree的构建与匹配算法;采用向量内积来计算文本类别分量与类别标志向量的相似度,进而使用规则置信度和类别相似度的聚合值作为文本分类的依据.基于实际网络文本的实验表明,该方法仅需提取30个特征词,分类结果的微平均值即可达到92.42%,优于未经剪枝的ARC-BC分类器及KNN、Bayes分类器;在分类耗时方面,该方法与未经剪枝的ARC-BC分类器持平,表明该方法引入的相似度与聚合值的计算开销在可接受的范围内.Conventional association rule-based categorization methods have bottleneck in improving classifier's accuracy,since these methods only consider the rule confidence degree and use the pruning technique.A novel method to solve this problem is proposed,and is called associative rule-based classifier aggregating with category similarity(AACS).The method adopts the modified chi-square statistical technique to extract feature terms from each category,and employs the CR-tree to store classification rules.Algorithms to construct and to match CR-tree are proposed.Inner-product is used to calculate the similarity between the category sub vector of the text and the category feature vector,and then is aggregated with the rules' confidence degree to serve as the foundation of text categorization.Experimental results show that the method presented achieves a micro-average value of categorization 92.42% with extracting only 30 feature terms,which is better than the results of AWOPR,KNN,and Bayes classifiers.And the time complexity of the method is the same as that of AWOPR,indicating that the cost to calculate both the similarity and the aggregation is acceptable.

关 键 词:文本分类 关联规则 类别相似度 聚合 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象