基于语义聚类的关键词抽取方法被引量：3

A Keyword Extraction Method Based On Semantic Clustering

作　　者：李旭晖[1,2] 周怡[1] LI Xu-hui;ZHOU Yi(School of Information Management,Wuhan University,Wuhan 430072,China;Big Data Research Institute,Wuhan University,Wuhan 430072,China)

机构地区：[1]武汉大学信息管理学院,湖北武汉430072 [2]武汉大学大数据研究院,湖北武汉430072

出　　处：《情报科学》2022年第3期99-108,共10页Information Science

基　　金：国家自然科学基金重大研究计划“大数据驱动的管理与决策研究”重点支持项目“基于知识关联的金融大数据价值分析、发现及协同创造机制”(91646206);中证信用-武汉大学信用科技联合实验室基金;武汉大学图书情报国家级实验教学示范中心支持。

摘　　要：【目的/意义】关键词抽取的本质是找到能够表达文档核心语义信息的关键词汇,因此使用语义代替词语进行分析更加符合实际需求。本文基于TextRank词图模型,利用语义代替词语进行分析,提出了一种基于语义聚类的关键词抽取方法。【方法/过程】首先,将融合知网(HowNet)义原信息训练的词向量聚类,把词义相近的词语聚集在一起,为各个词语获取相应的语义类别。然后,将词语所属语义类别的窗口共现频率作为词语间的转移概率计算节点得分。最后,将TF-IDF值与节点得分进行加权求和,对关键词抽取结果进行修正。【结果/结论】从整体的关键词抽取结果看,本文提出的关键词抽取方法在抽取效果上有一定提升,相比于TextRank算法在准确率P,召回率R以及F值上分别提升了12.66%、13.77%、13.16%。【创新/局限】本文的创新性在于使用语义代替词语,从语义层面对相关性网络进行分析。同时,首次引入融合知网义原信息的词向量用于关键词抽取工作。局限性在于抽取方法依赖知网信息,只适用于中文文本抽取。【Purpose/significance】The essence of keyword extraction is to find the key words that can express the core semantic information of the document, so using semantics instead of words for analysis is more in line with the actual needs. Based on TextRank word graph model, this paper proposes a keyword extraction method based on semantic clustering by using semantic instead of word analysis.【Method/process】The word embedding trained with HowNet semantic information are clustered to gather the words with similar meanings and obtain the corresponding semantic categories for each word. Then, the window co-occurrence frequency of the semantic category of the word is taken as the transition probability between words to calculate the node score. At the same time, the TFIDF value and the node score are weighted together to correct the keyword extraction results.【Result/conclusion】From the keyword extraction results, the keyword extraction method proposed in this paper has a certain improvement in the extraction effect. Compared with TextRank algorithm, the accuracy P, recall R and F values are improved by 12.66%, 13.77% and 13.16% respectively.【Innovation/limitation】This paper analyzes the relevance network based on semantics, and measures the weight of words through semantic clustering. This idea can be combined with the existing methods based on word analysis to replace words from the semantic level. At the same time, this paper introduces the word vector fusing the semantic information of HowNet for keyword extraction for the first time, which can be used as a reference for other natural language processing work because the keyword extraction method in this paper relies on Chinese knowledge base HowNet, it is only suitable for Chinese text extraction, and does not have the universality of cross language application.

关键词：关键词抽取词向量语义 TextRank 聚类

分类号：G254[文化科学—图书馆学]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语义聚类的关键词抽取方法被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语义聚类的关键词抽取方法 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于语义聚类的关键词抽取方法被引量：3