基于包含度和频繁模式的文本特征选择方法被引量：2

Text Feature Selection Based on Inclusion Degree and Frequent Pattern

作　　者：池云仙赵书良[2] 李仁杰[1] CHI Yunxian;ZHAO Shuliang;LI Renjie(College of Resources and Environment Science,Hebei Normal University,Shijiazhuang,Hebei 050024,China;College of Mathematic and Information Science,Hebei Normal University,Shiiiazhuang,Hebei 050024,China)

机构地区：[1]河北师范大学资源与环境科学学院,河北石家庄050024 [2]河北师范大学数学与信息科学学院,河北石家庄050024

出　　处：《中文信息学报》2018年第8期91-102,共12页Journal of Chinese Information Processing

基　　金：国家自然科学资金(71271067);国家社科基金(13&ZD091);河北省高等学校科学技术研究项目(QN2014196)

摘　　要：大数据时代,文本数据量的爆炸式增长使得特征选择成为文本挖掘领域最关键的任务之一。文档中的词语和模式规模庞杂,故需保证所挖掘特征的质量充满挑战。"基于模式"特征选择方法具有传统"基于词语"方法所没有的优越特性,可以进行有效地信息去噪,提升文本挖掘性能。该文提出基于包含度和频繁模式的文本特征选择方法:首先,定义基于包含度的相似性度量原理;然后,提出基于包含度的冗余文本频繁模式过滤方法。基于包含度度量文本频繁模式间相似性,以此去除子模式及相似度较高的交叉模式。再通过冗余模式去噪,提升文本频繁模式挖掘性能;提出基于关联度的文本特征选择方法。以经过过滤处理后的非冗余文本频繁模式为基础,进行文本特征选择,并利用词语与文档的关联度进行词语类别划分及权重分配。使所选特征与文档关联度更加清晰,分类效果更好。通过在数据集Reuters-21578上的实验得知,基于包含度和频繁模式的文本特征选择算法性能,优于当前普遍应用的传统文本特征选择方法和新的特征选择及特征抽取方法。In big data era,the growth rate of text information is too fast to deal with.Finding text features is one of the key issues in field of text mining.It is a great challenge to ensure the quality of features,which are mined from texts,due to the presence of large-scale words and patterns.Pattern-based methods have many superior characters while term-based methods have not.Pattern-based methods can remove noises efficaciously and promote performance of text mining.Algorithm Text Feature Selection Based on Inclusion Degree and Frequent Pattern（TFSIDFP）is proposed.First of all,standard of similarity measure for frequent patterns based on inclusion degree is defined.Secondly,algorithm Filtration of Redundancy for Frequent Patterns based on Inclusion Degree Theory（FRFPIDT）is put forward,algorithm FRFPIDT measures similarity of frequent patterns based on inclusion degree,and removes subpatterns and cross-patterns with high similarity degree.Performance of frequent patterns mining is increased by cutting out redundancy patterns.At last,feature weighting model is put forward.In this model,features are selected based on non-redundant frequent patterns that are disposed through algorithm TFSIDFP.Correlation between features and documents is taken into account in feature weighting,thus correlation degree between them is higher and effect of classification is better.Experimental results on data sets from Reuters-21578 indicate algorithm TFSIDFP is superior to the widely used feature selection and feature extraction methods.

关键词：大数据文本挖掘文本频繁模式包含度文本特征选择

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于包含度和频繁模式的文本特征选择方法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于包含度和频繁模式的文本特征选择方法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于包含度和频繁模式的文本特征选择方法被引量：2