检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:殷硕 王卫亚[1] 柳有权[1] YIN Shuo;WANG Wei-ya;LIU You-quan(School of Information Engineering,Chang’an University,Xi’an 710064,China)
出 处:《计算机技术与发展》2020年第3期46-50,共5页Computer Technology and Development
基 金:中央高校基本科研业务费专项资金(310824173401)。
摘 要:基于向量空间模型(VSM)的文本聚类会出现向量维度过高以及缺乏语义信息的问题,导致聚类效果出现偏差。为解决以上问题,引入《知网》作为语义词典,并改进词语相似度算法的不足。利用改进的词语语义相似度算法对文本特征进行语义压缩,使所有特征词都是主题相关的,利用调整后的TF-IDF算法对特征项进行加权,完成文本特征抽取,降低文本表示模型的维度。在聚类中,将同一类的文本划分为同一个簇,利用簇中所有文本的特征词完成簇的语义特征抽取,簇的表示模型和文本的表示模型有着相同的形式。通过计算簇之间的语义相似度,将相似度大于阈值的簇合并,更新簇的特征,直到算法结束。通过实验验证,与基于K-Means和VSM的聚类算法相比,文中算法大幅降低了向量维度,聚类效果也有明显提升。Text clustering based on vector space model(VSM)has the problems of too high vector dimension and lack of semantic information,which results in the deviation of clustering effect.In order to solve the above problems,we introduce HowNet as semantic dictionary and improve the word similarity algorithm.The improved word semantic similarity algorithm is used to compress the text features semantically so that all feature words are subject-related.The adjusted TF-IDF algorithm is used to weigh the feature items to complete the text feature extraction and reduce the dimension of the text representation model.In clustering,the text of the same class is divided into the same cluster,and the semantic features of the cluster are extracted by using the feature words of all the text in the cluster.The representation model of the cluster has the same form as the representation model of the text.By calculating the semantic similarity between the clusters,the clusters with similarity greater than the threshold are merged and the features of clusters are updated until the end of the algorithm.Experiment shows that compared with K-Means and VSM-based clustering algorithm,the proposed algorithm greatly reduces the vector dimension and improves the clustering effect significantly.
关 键 词:文本聚类 语义特征抽取 特征降维 文本相似度 知网
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.221