基于子句抽取的文本摘要自动提取算法  被引量:1

An Automatic Text Summarization Algorithm Based on Clause Extraction

在线阅读下载全文

作  者:朱兵兵 罗飞[1] 罗勇军[1] 丁炜超 黄浩 ZHU Bingbing;LUO Fei;LUO Yongjun;DING Weichao;HUANG Hao(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)

机构地区:[1]华东理工大学信息科学与工程学院,上海200237

出  处:《华东理工大学学报(自然科学版)》2024年第1期114-120,共7页Journal of East China University of Science and Technology

基  金:上海市自然科学基金(22ZR1416500);上海市青年科技英才杨帆计划(20YF1410900);上海市2021年度“科技创新行动计划”长三角科技创新共同体领域项目(21002411000)。

摘  要:TextRank算法及SWTextRank等改进算法在抽取式摘要生成中得到了广泛的应用,但它们都没有有效地解决抽取式摘要所存在的冗余性问题。为此,提出一种基于子句抽取的文本摘要自动提取算法(PTextRank)。首先,使用Sinica Treebank(STB)对每个句子进行语法标记,进而基于子句设置抽取单元;接着,使用BERT(Bidirectional Encoder Representation from Transformers)构建标题和每个子句的特征向量,并计算子句特征向量间的相似性,将其存放在相似度矩阵中;最后结合子句位置、子句与标题的相似度等调整子句相似度矩阵,迭代计算直至收敛,进而选取得分最高的子句作为最终摘要。实验分析表明,PTextRank算法有效地避免了多个句子中存在的冗余信息,且相比于TextRank和SWTextRank,PTextRank生成摘要的准确率至少提高6%,同时生成的摘要质量更好。In today's exponential growth of information data,it is undoubtedly a better choice for people to obtain effective data in a short period of time via automatic summary technology.Among them,how to extract key information from redundant and unstructured long text and make the extracted information concise and smooth is a key issue.The TextRank algorithm and improved algorithms such as SWTextRank have been widely used in the generation of extracted abstracts,but they have not effectively solved the redundancy problem that exists in extracted abstracts.Therefore,this paper proposes an automatic text summarization extraction algorithm based on Clause extraction(PTextRank).Firstly,the text is preprocessed and divided into sentences,after which Sinica Treebank(STB)is used to mark each sentence,and then set extraction units based on clause.Next,BERT is used to construct the title and feature vector for each clause,and then the similarity between the feature vectors of the clause is calculated and stored in the similarity matrix.Finally,the clause similarity matrix is adjusted according to the clause position and the similarity between the clause and the title,the calculation is iteratively made until convergence,and then,the clause with the highest score is selected as the final summary.Experiments and analysis show that PTextRank algorithm effectively avoids redundant information in multiple sentences,and compared to traditional TextRank and the improved SWTextRank,the accuracy of PTextRank in generating abstracts is improved by at least 6%,while the quality of the generated abstract is better.In PTextRank algorithm,clauses are used as extraction units,starting from finer-grained extraction units to avoid redundant information in multiple sentences.

关 键 词:TextRank 摘要提取 冗余处理 Sinica Treebank 篇章结构 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象