基于互信息和关联规则的文本特征提取方法  被引量:1

Text Feature Extraction Method Based on Mutual Information and Association Rules

在线阅读下载全文

作  者:瞿学新 朱全银 严云洋[1,2] 李翔 QU Xue-xin;ZHU Quan-yin;YAN Yun-yang;LI Xiang(School of Computer Science and Technology,Southwest University of Science and Technology,Mianyang Sichuan 621010,China;Faculty of Computer and Software Engineering,Huaiyin Institute of Technology,Huai'an Jiangsu 223003,China)

机构地区:[1]西南科技大学计算机科学与技术学院,四川绵阳621010 [2]淮阴工学院计算机与软件工程学院,江苏淮安223003

出  处:《淮阴工学院学报》2018年第3期20-24,共5页Journal of Huaiyin Institute of Technology

基  金:江苏省"六大人才高峰"项目(2013DZXX-023);江苏省"333工程"(BRA2013208);江苏省重点研发计划(BE2015127);淮安市产学研协同创新项目(HAC201601)

摘  要:为改善传统互信息方法在网页分类中的效果,对互信息方法在词频、类间分布以及低信息量特征方面进行改善,提出了一种基于互信息和关联规则的文本特征提取方法。改进了传统互信息方法,引进词频和类间平衡因子,从而避免互信息对低词频特征值放大;改进互信息特征提取后,计算低信息量特征与高信息量特征的关联规则,以置信度为概率将低信息特征替换为对应规则中的高信息量特征;将置换后的样本集再进行向量化。实验表明,该方法相比传统的互信息方法具有较好的分类性能,F1值平均提高了约6%。将该方法应用于网页分类中,结果显示改进后的互信息方法在网页分类中具有较好的性能。In order to improve the effectiveness of traditional mutual information methods in web page classification,the mutual information methods are improved in term frequency,inter-class distribution,and low information volume characteristics. For this reason,a text feature extraction method based on mutual information and association rules is proposed. Firstly,by improve the traditional mutual information method,introduce the word frequency and inter-class balance factor,so as to avoid the disadvantages of mutual information amplification for low word frequency eigenvalues; secondly,after improving the mutual information feature extraction,calculate the characteristics of low information volume and high information volume. The association rule replaces the low information feature with the high information feature in the corresponding rule with confidence as probability; finally,the sample set after the replacement is then quantized. Experiments using widely-recognized Sogou data and Net Ease news data show that the method has better classification performance than traditional mutual information methods,and the average F1 value is improved by about 6%. This method is applied to web page classification.The results show that the improved mutual information method has better performance in web page classification.

关 键 词:互信息 网页分类 关联规则 文本特征 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象