检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵军愉 柴小亮 李士林 徐松晓 王强 ZHAO Junyu;CHAI Xiaoliang;LI Shilin;XU Songxiao;WANG Qiang(Baoding Power Supply Branch of State Grid Hebei Electric Power Co.,Ltd.,Shijiazhuang 050021,China;State Grid Hebei Electric Power Co.,Ltd.,Shijiazhuang 050021,China)
机构地区:[1]国网河北省电力有限公司保定供电分公司,河北石家庄050021 [2]国网河北省电力有限公司,河北石家庄050021
出 处:《微型电脑应用》2023年第10期181-183,187,共4页Microcomputer Applications
摘 要:为了提高大量文本数据的特征选择能力,采用全覆盖粒计算方法对特征选择算法的数据高维性与稀疏性进行分析。针对TFIDF算法存在的缺陷,设计了一种经过改进后的TFIDF_SP算法,以区分文档内处于不同部位的特征词重要性,并根据不同特征选择方法对比结果判断算法有效性。研究结果表明,采用bLDA主题模型提取细主题粒度的时候也无法获得理想聚类效果,此时会对相同主题特征词造成弱化,将其判断为不同主题类型的特征词。在γ取值等于0.8时可以获得最优聚类效果,此时改进TFIDF算法能促进权重的进一步提升。所提出的改进TFIDF算法可以获得比TFIDF和bLDA主题模型更好的结果结合高1.62%的聚类准确率,表明当特征词方式词性与位置变化时会引起文档表达效果的显著影响。In order to improve the feature selection ability of a large number of text data,the full-coverage grain computing method is used to analyze the data high dimension and sparsity of feature selection algorithm.Aiming at the above defects of TFIDF algorithm,an improved TFIDF_SP algorithm is designed to distinguish the importance of feature words in different parts of the document,and judge the effectiveness of the algorithm by comparing the results of different feature selection methods.The results show that when bLDA topic model is used to extract fine topic granularity,the ideal clustering effect cannot be obtained,and the same topic feature words are weakened,and they are judged as feature words of different topic types.When the value ofγis equal to 0.8,the optimal clustering effect can be obtained.In this case,the improved TFIDF algorithm proposed in this paper can promote the further improvement of the weights.The improved TFIDF algorithm can increase the clustering accuracy by 1.62%compared with the combination of TFIDF and bLDA topic model,indicating that the change of feature word mode,part of speech and position can significantly affect the document expression effect.
关 键 词:文本特征选择 改进TFIDF算法 聚类效果 主题模型
分 类 号:TP39[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.144.94.139