基于BERT模型的无监督中文单文本关键词提取模型  被引量:2

Unsupervised keyword extraction model for Chinese single text based on BERT model

在线阅读下载全文

作  者:顾淳 俞成海[1] 于洋 关炜炜 GU Chun;YU Chenghai;YU Yang;GUAN Weiwei(School of Information Science and Technology,Zhejiang Sci-Tech University,Hangzhou 310018,China)

机构地区:[1]浙江理工大学信息学院,杭州310018

出  处:《浙江理工大学学报(自然科学版)》2022年第3期424-432,共9页Journal of Zhejiang Sci-Tech University(Natural Sciences)

基  金:浙江省重点研发计划项目(2020C03094)。

摘  要:针对现有方法存在的忽略语义信息及重复提取语义相近关键词等问题,提出了一种基于Bidirectional encoder representation from transformers(BERT)模型的无监督中文单文本关键词提取模型。该模型首先对待提取文本进行预处理以选取候选词,接着使用BERT模型的隐藏层结合全文信息获取候选词的词向量,然后加入聚类层筛除语义重复的候选词,最后获取全文语义向量并计算候选词与全文的语义的相似度评分,经排序后提取关键词。实验结果表明:将模型用于混合主题中文论文摘要等较短文本,在提取关键词的数量分别为5和8时,该模型的准确率分别为34.21%和26.34%,优于Text Rnka、TF-IDF等传统提取模型,表明该模型通过融合语义信息提升了中文单文本关键词提取的准确率,改善了关键词重复提取的问题,使提取的关键词更加准确,有效提升了中文单文本关键词提取质量。Since the existing methods have inadequate consideration of semantic information and extracts keywords with similar semantics, an unsupervised Chinese single text keyword extraction model was proposed based on the bidirectional encoder representation from transformers(BERT) model. Firstly, this model preprocessed the extracted text and selected candidate words from it, then used the hidden layer of BERT model to obtain the vector of candidate words in combination with full-text information, and then added a clustering layer to eliminate candidate words with similar semantics. Finally, the full-text semantic vector was obtained, and keywords were extracted by calculating the similarity score between the candidate word and full text. It was found that in short texts, for example, the use of the model in abstracts of Chinese papers on mixed topics, when the number of extracted keywords was 5 and 8, the accuracy of this method was up to 34.21% and 26.34%, which was better than traditional extraction models such as Text Rank and TF-IDF. This implied that the proposed model effectively improved the keyword extraction accuracy of Chinese single texts by fusing semantic information, improved the problem of repeated keyword extraction, made the extracted keywords more accurate, and effectively enhanced the extraction quality of Chinese single text keywords.

关 键 词:关键词提取 无监督 BERT模型 文本向量化 单文本 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象