检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:瞿学新 朱全银 严云洋[1,2] 李翔 QU Xue-xin;ZHU Quan-yin;YAN Yun-yang;LI Xiang(School of Computer Science and Technology,Southwest University of Science and Technology,Mianyang Sichuan 621010,China;Faculty of Computer and Software Engineering,Huaiyin Institute of Technology,Huai'an Jiangsu 223003,China)
机构地区:[1]西南科技大学计算机科学与技术学院,四川绵阳621010 [2]淮阴工学院计算机与软件工程学院,江苏淮安223003
出 处:《淮阴工学院学报》2018年第3期20-24,共5页Journal of Huaiyin Institute of Technology
基 金:江苏省"六大人才高峰"项目(2013DZXX-023);江苏省"333工程"(BRA2013208);江苏省重点研发计划(BE2015127);淮安市产学研协同创新项目(HAC201601)
摘 要:为改善传统互信息方法在网页分类中的效果,对互信息方法在词频、类间分布以及低信息量特征方面进行改善,提出了一种基于互信息和关联规则的文本特征提取方法。改进了传统互信息方法,引进词频和类间平衡因子,从而避免互信息对低词频特征值放大;改进互信息特征提取后,计算低信息量特征与高信息量特征的关联规则,以置信度为概率将低信息特征替换为对应规则中的高信息量特征;将置换后的样本集再进行向量化。实验表明,该方法相比传统的互信息方法具有较好的分类性能,F1值平均提高了约6%。将该方法应用于网页分类中,结果显示改进后的互信息方法在网页分类中具有较好的性能。In order to improve the effectiveness of traditional mutual information methods in web page classification,the mutual information methods are improved in term frequency,inter-class distribution,and low information volume characteristics. For this reason,a text feature extraction method based on mutual information and association rules is proposed. Firstly,by improve the traditional mutual information method,introduce the word frequency and inter-class balance factor,so as to avoid the disadvantages of mutual information amplification for low word frequency eigenvalues; secondly,after improving the mutual information feature extraction,calculate the characteristics of low information volume and high information volume. The association rule replaces the low information feature with the high information feature in the corresponding rule with confidence as probability; finally,the sample set after the replacement is then quantized. Experiments using widely-recognized Sogou data and Net Ease news data show that the method has better classification performance than traditional mutual information methods,and the average F1 value is improved by about 6%. This method is applied to web page classification.The results show that the improved mutual information method has better performance in web page classification.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.166