检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈洁[1] CHEN Jie(School of Data Science and Information Technology,China Women’s University,Beijing 100101,China)
机构地区:[1]中华女子学院数据科学与信息技术学院,北京100101
出 处:《计算机科学》2023年第S01期211-216,共6页Computer Science
基 金:中华女子学院科研基金(ZKY200020228)。
摘 要:针对新闻长文本语义表征的难点,基于Doc2Vec文档嵌入和词向量加权方式构建增强的特征表示。利用DV-sim方法和DV-tfidf方法从文档首尾部分特定词性的内容中提取增强特征,再分别与Doc2Vec文档向量组合,形成新的全局表征。DV-sim从语义角度,采用特征词与Doc2Vec向量的相似度获得词权重;DV-tfidf从词频统计角度,采用词频-逆文档频率方式获得词权重,然后利用HDBSCAN算法在THUCNews和Sogou数据集上进行主题聚类。相比直接应用Doc2Vec向量,DV-sim在两个数据集上的噪声数分别减少60.82%和60.63%,准确率提高12.14%和20.58%,F1-Score值提高15.61%和11.58%;DV-tfifd在两个数据集上的噪声数分别减少15.20%和59.55%,准确率提高10.85%和17.93%,F1-Score值提高15.60%和9.21%。实验结果表明,DV-sim和DV-tfidf都可以提高主题聚类性能,且基于语义的增强特征比基于词频的效果更好,DV-sim在优秀女性人物报道的主题聚类上也得到了有效应用。Aimed at the difficulties of semantic representation of long news text,an enhanced document feature representation is constructed based on Doc2Vec embedding and word vector weighting.Enhanced features from the specific parts-of-speech contents on the head and tail of the document are extracted by the method of DV-sim or DV-tfidf.These features are then combined with doc2vec to form a new global representation.DV-sim uses the similarity between feature words and doc2vec vectors to obtain word weight from the semantic point of view,and DV-tfidf uses term frequency inverse document frequency to obtain word weight from the word frequency statistics point of view.Then the HDBSCAN algorithm is applied to cluster topics on the Thucnews and Sogou datasets.Compared with the Doc2Vec vector,the noise number on the two datasets reduces by 60.82%and 60.63%,the accuracy improves by 12.14%and 20.58%,and the F1-score increases by 15.61%and 11.58%,respectively,with DV-sim.The noise number on the two datasets reduces by 15.20%and 59.55%,the accuracy improves by 10.85%and 17.93%,and the F1-score increases by 15.60%and 9.21%,respectively,with DV-tfidf.Experiments show that both DV-sim and DV-tfidf can improve the performance of topic clustering,and the enhancement feature based on semantics is better than that based on word frequency.DV-sim has also been effectively applied in topic clustering of excellent female character reports.
关 键 词:主题聚类 文本表征 Doc2Vec 词向量 HDBSCAN
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.30