基于改进TF-IDF特征提取的文本分类模型研究  被引量:53

Research of Text Classification Model Based on the Improved TF-IDF Feature Extraction

在线阅读下载全文

作  者:周源[1] 刘怀兰[2] 杜朋朋[2] 廖岭[2] 

机构地区:[1]清华大学公共管理学院,北京100084 [2]华中科技大学机械科学与工程学院,湖北武汉430074

出  处:《情报科学》2017年第5期111-118,共8页Information Science

基  金:国家自然科学基金项目(91646102;L1624045;L1624041;L1524015;71203117);教育部人文社会科学项目(16JDGC011)

摘  要:【目的/意义】特征提取会很大程度地影响分类效果,而传统TF-IDF特征提取方法缺乏对特征词上下文环境和对特征词在类之间分布状况的考虑。【方法/过程】本文提出一种改进TF-IDF特征提取的方法:(1)基于文本网络和改进Page Rank算法计算节点重要程度值,解决传统TF-IDF忽略文本结构信息的问题;(2)增加特征值IDF值的方差来衡量特征词w在不同类别文本集中程度的分布情况,解决传统TF-IDF忽略特征词在类之间分布状况的不足。【结果/结论】基于该改进方法构建了文本分类模型,对3D打印数据进行分类实验。对比算法改进前后的分类效果,验证了该方法能够有效提高文本特征词提取的准确度。[Purpose/significance] Feature extraction plays an important role in text classification, while traditional TF- IDF method lacks consideration of the context of feature words and its distribution between the classes. [Method/process] The study proposes an improved TF - IDF feature extraction methods: 1) in order to solve that the traditional TF - IDF ig- nores the text structure information, the paper computes node importance value based on text network and improved PageR- ank algorithm; 2) in order to solve that the traditional TF - IDF overlooks feature words distribution between classes, the pa- per increases the variance of IDF values represent the distribution of text focused concentration of different types of w. [ Resuit/conclusion ] Based on the improved method to construct a text classification model, and take 3D printing as a classifica- tion case. Comparing the classification results before and after the improved algorithm process, the improved TF - IDF method is verified to extract text feature words accurately and effectively.

关 键 词:特征提取 TF—IDF 文本分类 文本网络 PAGERANK 

分 类 号:G254[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象