检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:耿新青 王正欧[2] Geng Xinqing;Wang Zhengou(College of Mathematics and Information Science,Anshan Normal University,Anshan 114007,China;Institute of System Engineering,Tianjin University,Tianjin 300072,China)
机构地区:[1]鞍山师范学院数学与信息科学学院,辽宁鞍山114007 [2]天津大学系统工程研究所,天津300072
出 处:《南京理工大学学报》2022年第5期579-585,593,共8页Journal of Nanjing University of Science and Technology
基 金:国家自然科学基金(60275020)。
摘 要:针对传统模糊聚类算法需要预先确定初始隶属度矩阵的问题,该文提出了基于增量式模糊聚类算法(Incremental fuzzy clustering algorithm,FCLDA)的文本挖掘方法。首先根据文本集中关键词出现次数进行排序,优先选择出现次数多的关键词作为文本集的主题,然后利用隐含狄利克雷分布(Latent Dirichlet allocation,LDA)主题模型构建文档-主题概率分布组成矩阵,将该矩阵作模糊C均值聚类(FCM)算法的隶属度矩阵,并对隶属度矩阵的隶属度值增加一个权值,在FCLDA算法迭代过程中,采用模糊信息熵作为聚类数确定的标准,增加主题词,当模糊信息熵达到最小值时,聚类数确定下来,最后将FCLDA算法应用到网页的文本挖掘中,结果试验表明,相对于FCM算法和K最近邻(K-nearest neighbor)算法,FCLDA算法的运行聚类结果准确率更高,运行速度加快,更适合处理具有模糊性的文本。Aiming at the problem that the traditional fuzzy clustering algorithm needs to determine the initial membership matrix in advance,a text mining method based on incremental fuzzy clustering algorithm(FCLDA)is proposed in this paper.Firstly,the keywords in the text set are sorted according to the occurrence times of keywords in the text set,and the keywords with more occurrences are preferentially selected as the topic of the text set.Then,the document topic probability distribution composition matrix is constructed by using Latent Dirichlet Allocation topic model.The matrix is used as the membership matrix of fuzzy C-means clustering(FCM)algorithm,and a weight is added to the membership value of the membership matrix.In the iterative process of IFCA,fuzzy information entropy is used as the standard for determining the cluster number,and subject words are added.When the fuzzy information entropy reaches the minimum,the cluster number is determined.Finally,FCLDA algorithm is applied to web page text mining.The results show that compared with the FCM algorithm and K-nearest neighbor algorithm,FCLDA algorithm has higher accuracy and faster running speed,It is more suitable for dealing with fuzzy text.
关 键 词:狄利克雷分布主题模型 模糊聚类 聚类数 模糊信息熵 文本聚类
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.216.51.7