基于增量式模糊聚类算法的文本挖掘被引量：6

Text mining based on incremental fuzzy clustering algorithm

作　　者：耿新青王正欧[2] Geng Xinqing;Wang Zhengou(College of Mathematics and Information Science,Anshan Normal University,Anshan 114007,China;Institute of System Engineering,Tianjin University,Tianjin 300072,China)

机构地区：[1]鞍山师范学院数学与信息科学学院,辽宁鞍山114007 [2]天津大学系统工程研究所,天津300072

出　　处：《南京理工大学学报》2022年第5期579-585,593,共8页Journal of Nanjing University of Science and Technology

基　　金：国家自然科学基金(60275020)。

摘　　要：针对传统模糊聚类算法需要预先确定初始隶属度矩阵的问题,该文提出了基于增量式模糊聚类算法(Incremental fuzzy clustering algorithm,FCLDA)的文本挖掘方法。首先根据文本集中关键词出现次数进行排序,优先选择出现次数多的关键词作为文本集的主题,然后利用隐含狄利克雷分布(Latent Dirichlet allocation,LDA)主题模型构建文档-主题概率分布组成矩阵,将该矩阵作模糊C均值聚类(FCM)算法的隶属度矩阵,并对隶属度矩阵的隶属度值增加一个权值,在FCLDA算法迭代过程中,采用模糊信息熵作为聚类数确定的标准,增加主题词,当模糊信息熵达到最小值时,聚类数确定下来,最后将FCLDA算法应用到网页的文本挖掘中,结果试验表明,相对于FCM算法和K最近邻(K-nearest neighbor)算法,FCLDA算法的运行聚类结果准确率更高,运行速度加快,更适合处理具有模糊性的文本。Aiming at the problem that the traditional fuzzy clustering algorithm needs to determine the initial membership matrix in advance,a text mining method based on incremental fuzzy clustering algorithm(FCLDA)is proposed in this paper.Firstly,the keywords in the text set are sorted according to the occurrence times of keywords in the text set,and the keywords with more occurrences are preferentially selected as the topic of the text set.Then,the document topic probability distribution composition matrix is constructed by using Latent Dirichlet Allocation topic model.The matrix is used as the membership matrix of fuzzy C-means clustering(FCM)algorithm,and a weight is added to the membership value of the membership matrix.In the iterative process of IFCA,fuzzy information entropy is used as the standard for determining the cluster number,and subject words are added.When the fuzzy information entropy reaches the minimum,the cluster number is determined.Finally,FCLDA algorithm is applied to web page text mining.The results show that compared with the FCM algorithm and K-nearest neighbor algorithm,FCLDA algorithm has higher accuracy and faster running speed,It is more suitable for dealing with fuzzy text.

关键词：狄利克雷分布主题模型模糊聚类聚类数模糊信息熵文本聚类

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于增量式模糊聚类算法的文本挖掘被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于增量式模糊聚类算法的文本挖掘 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于增量式模糊聚类算法的文本挖掘被引量：6