检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:姚奕 杨帆 YAO Yi;YANG Fan(College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China)
机构地区:[1]陆军工程大学指挥控制工程学院,南京210007
出 处:《计算机科学》2022年第10期243-251,共9页Computer Science
基 金:军事类研究生资助课题(JY2019C078)。
摘 要:关键词表征了文本的主题,是文本概念和主题的凝练。通过关键词,读者可以快速了解文档表达的主旨和思想,从而提升信息检索效率;此外,关键词抽取也可以为自动摘要、文本分类提供支撑。近年来,自动抽取关键词的研究引起了广泛关注,但如何精准地抽取文档的关键词仍是一个挑战。一方面,关键词是人们主观的认识,判断一个词是否是关键词本身具有主观性;另一方面,中文词汇往往具有丰富的语义信息,单纯依赖传统统计特征和主题特征难以准确提炼文本所表达的主旨思想。针对中文关键词抽取中存在的准确率低、信息冗余和信息缺失等问题,提出了一种联合知识图谱和预训练模型的无监督关键词抽取方法。该方法首先利用预训练模型进行主题聚类,并通过一种以句子为单位的聚类方法保证最终选取的关键词对全文内容的覆盖度;同时,通过知识图谱进行实体链接,以此实现精准分词及歧义消除;然后,根据主题信息构建语义词图,并以此为基础计算词语间的语义权重;最后,通过加权的PageRank算法进行关键词排序。在DUC 2001和CSL两个公开数据集和一个单独标注的CLTS数据集上,以预测结果的准确率、召回率及F1值为指标进行对比实验。实验结果表明,该模型相比多种基线方法,准确率均有所提升,在CLTS数据集上与传统统计方法 TF-IDF相比F1值提高了9.14%,与传统图方法 TextRank相比F1值提高了4.82%。Keywords represent the theme of the text, which is the condensed concept and content of the text.Through keywords, readers can quickly understand the gist and idea of the text and improve the efficiency of information retrieval.In addition, keyword extraction can also provide support for automatic text summarization and text classification.In recent years, research on automatic keyword extraction has attracted wide attention, but how to extract keywords from documents accurately remains a challenge.On the one hand, the keyword is people’s subjective understanding, judging whether a word is a keyword itself is subjective.On the other hand, Chinese words are often rich in semantic information and it is difficult to accurately extract the main idea expressed in the text by solely relying on traditional statistical features and thematic features.Aiming at the problems of low accuracy, information redundancy and information missing in Chinese keyword extraction, this paper proposes an unsupervised keyword extraction method combining knowledge graph and pre-training model.Firstly, topic clustering is carried out by using the pre-training model, and a sentence-based clustering method is proposed to ensure the coverage of the final selected keyword.Then, the knowledge graph is used for entity linking to achieve accurate word segmentation and semantic disambiguation.After that, the semantic word graph is constructed based on the topic information to calculate the semantic weight between words.Finally, keywords are sorted by the weighted PageRank algorithm.Experiments are conducted on two public datasets, DUC 2001 and CSL,and a separate annotated CLTS dataset, the prediction accuracy, recall rate and F1 score are taken as indicators in comparative experiments.Experimental results show that the accuracy of the proposed method has improved compared with other baseline methods, F1 value is increased by 9.14% compared with the traditional statistical method TF-IDF,and increased by 4.82% compared with the traditional graph me
关 键 词:关键词抽取 知识图谱 句嵌入 聚类 图算法 预训练模型
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7