基于改进TFIDF算法的文本特征选择和聚类分析  被引量:1

Text Feature Selection and Clustering Analysis Based on Improved TFIDF Algorithm

在线阅读下载全文

作  者:赵军愉 柴小亮 李士林 徐松晓 王强 ZHAO Junyu;CHAI Xiaoliang;LI Shilin;XU Songxiao;WANG Qiang(Baoding Power Supply Branch of State Grid Hebei Electric Power Co.,Ltd.,Shijiazhuang 050021,China;State Grid Hebei Electric Power Co.,Ltd.,Shijiazhuang 050021,China)

机构地区:[1]国网河北省电力有限公司保定供电分公司,河北石家庄050021 [2]国网河北省电力有限公司,河北石家庄050021

出  处:《微型电脑应用》2023年第10期181-183,187,共4页Microcomputer Applications

摘  要:为了提高大量文本数据的特征选择能力,采用全覆盖粒计算方法对特征选择算法的数据高维性与稀疏性进行分析。针对TFIDF算法存在的缺陷,设计了一种经过改进后的TFIDF_SP算法,以区分文档内处于不同部位的特征词重要性,并根据不同特征选择方法对比结果判断算法有效性。研究结果表明,采用bLDA主题模型提取细主题粒度的时候也无法获得理想聚类效果,此时会对相同主题特征词造成弱化,将其判断为不同主题类型的特征词。在γ取值等于0.8时可以获得最优聚类效果,此时改进TFIDF算法能促进权重的进一步提升。所提出的改进TFIDF算法可以获得比TFIDF和bLDA主题模型更好的结果结合高1.62%的聚类准确率,表明当特征词方式词性与位置变化时会引起文档表达效果的显著影响。In order to improve the feature selection ability of a large number of text data,the full-coverage grain computing method is used to analyze the data high dimension and sparsity of feature selection algorithm.Aiming at the above defects of TFIDF algorithm,an improved TFIDF_SP algorithm is designed to distinguish the importance of feature words in different parts of the document,and judge the effectiveness of the algorithm by comparing the results of different feature selection methods.The results show that when bLDA topic model is used to extract fine topic granularity,the ideal clustering effect cannot be obtained,and the same topic feature words are weakened,and they are judged as feature words of different topic types.When the value ofγis equal to 0.8,the optimal clustering effect can be obtained.In this case,the improved TFIDF algorithm proposed in this paper can promote the further improvement of the weights.The improved TFIDF algorithm can increase the clustering accuracy by 1.62%compared with the combination of TFIDF and bLDA topic model,indicating that the change of feature word mode,part of speech and position can significantly affect the document expression effect.

关 键 词:文本特征选择 改进TFIDF算法 聚类效果 主题模型 

分 类 号:TP39[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象