一种改进型TF-IDF文本聚类方法被引量：17

An Improved TF-IDF Text Clustering Method

作　　者：张蕾姜宇[2] 孙莉 ZHANG Lei;JIANG Yu;SUN Li(Division of Development and Strategic Planning,Jilin University,Changchun 130012,China;College of Computer Science and Technology,Jilin University,Changchun 130012,China)

机构地区：[1]吉林大学发展规划处,长春130012 [2]吉林大学计算机科学与技术学院,长春130012

出　　处：《吉林大学学报（理学版）》2021年第5期1199-1204,共6页Journal of Jilin University:Science Edition

基　　金：国家自然科学基金(批准号:62072211).

摘　　要：针对传统词频-逆文档频率(TF-IDF)算法对具有特定属性的文本分类存在的不足,尤其是词汇在特定分类中具有特殊意义情形下准确率较低的问题,提出一种改进的TF-IDF文本聚类算法.采用2015—2019年吉林省科研机构发表论文数据进行对比实验,分别用改进TF-IDF算法和传统TF-IDF算法先统计论文中的关键词词频,再通过K-means++算法进行聚类,最后使用随机森林算法分别评估聚类的准确性.实验结果表明,改进TF-IDF算法提高了分类的准确率.Aiming at the shortcomings of traditional term frequency-inverse document frequency(TF-IDF)algorithm for text classification with specific attributes,especially the low accuracy of words with specific meaning under specific classification,we proposed an improved TF-IDF text clustering algorithm.Comparative experiments were carried out through the papers published by scientific research institutions in Jilin Province from 2015 to 2019.The improved TF-IDF algorithm and the traditional TF-IDF algorithm were used to calculate the frequency of keywords in the papers,then K-means++method was used to cluster.Finally,random forest algorithm was used to evaluate the accuracy of clustering.The experimental results show that the improved TF-IDF algorithm improves the accuracy of classification.

关键词：词频-逆文档频率(TF-IDF) 混合聚类交叉学科基本科学指标数据库(ESI)文献

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进型TF-IDF文本聚类方法被引量：17

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种改进型TF-IDF文本聚类方法 被引量：17

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种改进型TF-IDF文本聚类方法被引量：17