一种改进的TextRank多文档文摘自动抽取模型  

Automatic Extraction Model of Multi-document Summarization Based on Improved TextRank Algorithm

在线阅读下载全文

作  者:王楠 曾曼玲[1] WANG Nan;ZENG Man-ling(School of Management Science and Information Engineering,Jilin University of Finance and Economics;Institute of Economic Information Management,Jilin University of Finance and Economics,Changchun 130117,China)

机构地区:[1]吉林财经大学管理科学与信息工程学院 [2]吉林财经大学经济信息管理研究所,吉林长春130117

出  处:《软件导刊》2023年第5期1-6,共6页Software Guide

基  金:吉林省高等教育教学改革研究重点课题(JLJY202269718747);吉林省教育厅“十三五”社会科学研究项目(JJKH20230195SK);国家社会科学基金项目(22BTQ048)。

摘  要:多文档自动文摘通过自然语言处理技术从多篇同主题的文档中提取概述性信息,可有效缓解信息负载问题,有助于用户迅速准确获取原文核心内容。针对中文文本特点,构建一种基于TextRank算法改进的多文档文摘自动抽取模型。首先通过预训练Word2Vec词向量模型与SIF方法融合,在中文维基百科语料库上进行预训练,获取文档中所有句子的句向量;然后借助余弦相似度构造TextRank句子间的边关系;最后使用MMR算法对文摘句进行冗余处理,得到全面又多样的文摘。通过ROUGE-N评价指标对模型进行性能评价,实验结果表明,所提模型的ROUGE-1、ROUGE-2、ROUGE-L指标值分别为0.549、0.322、0.357,均优于传统TextRank方法和Word2vec(实验样本语料)+TextRank+MMR模型,文摘质量更高。Multi-document summarization extracts general information from multiple documents of the same topic through natural language processing technology,which can effectively alleviate the current information load problem and help users quickly and accurately obtain the core content of the original text.An improved multi-document extractive summarization model based on the TextRank algorithm according to the characteristics of Chinese text was constructed.Firstly,the pre-training Word2Vec word vector model is fused with SIF method on Chinese Wikipedia corpus to obtain sentence vectors of all sentences in the document.Then,with the cosine similarity,the edge relations between Tex‐tRank sentences are created.Finally,the MMR algorithm is used to deal with the redundancy of abstract sentences to obtain a comprehensive and diverse summary.The performance of the model is evaluated by ROUGE-N evaluation index,and the comparative experimental results show that the values of ROUGE-1,ROUGE-2,and ROUGE-L indexes of the proposed model are 0.549,0.322,and 0.357,which are better than the traditional TextRank method and Word2vec(experimental sample corpus)+TextRank+MMR model,with higher quality.

关 键 词:多文档文摘 抽取式文摘 TextRank算法 Word2Vec SIF 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象