检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王楠 曾曼玲[1] WANG Nan;ZENG Man-ling(School of Management Science and Information Engineering,Jilin University of Finance and Economics;Institute of Economic Information Management,Jilin University of Finance and Economics,Changchun 130117,China)
机构地区:[1]吉林财经大学管理科学与信息工程学院 [2]吉林财经大学经济信息管理研究所,吉林长春130117
出 处:《软件导刊》2023年第5期1-6,共6页Software Guide
基 金:吉林省高等教育教学改革研究重点课题(JLJY202269718747);吉林省教育厅“十三五”社会科学研究项目(JJKH20230195SK);国家社会科学基金项目(22BTQ048)。
摘 要:多文档自动文摘通过自然语言处理技术从多篇同主题的文档中提取概述性信息,可有效缓解信息负载问题,有助于用户迅速准确获取原文核心内容。针对中文文本特点,构建一种基于TextRank算法改进的多文档文摘自动抽取模型。首先通过预训练Word2Vec词向量模型与SIF方法融合,在中文维基百科语料库上进行预训练,获取文档中所有句子的句向量;然后借助余弦相似度构造TextRank句子间的边关系;最后使用MMR算法对文摘句进行冗余处理,得到全面又多样的文摘。通过ROUGE-N评价指标对模型进行性能评价,实验结果表明,所提模型的ROUGE-1、ROUGE-2、ROUGE-L指标值分别为0.549、0.322、0.357,均优于传统TextRank方法和Word2vec(实验样本语料)+TextRank+MMR模型,文摘质量更高。Multi-document summarization extracts general information from multiple documents of the same topic through natural language processing technology,which can effectively alleviate the current information load problem and help users quickly and accurately obtain the core content of the original text.An improved multi-document extractive summarization model based on the TextRank algorithm according to the characteristics of Chinese text was constructed.Firstly,the pre-training Word2Vec word vector model is fused with SIF method on Chinese Wikipedia corpus to obtain sentence vectors of all sentences in the document.Then,with the cosine similarity,the edge relations between Tex‐tRank sentences are created.Finally,the MMR algorithm is used to deal with the redundancy of abstract sentences to obtain a comprehensive and diverse summary.The performance of the model is evaluated by ROUGE-N evaluation index,and the comparative experimental results show that the values of ROUGE-1,ROUGE-2,and ROUGE-L indexes of the proposed model are 0.549,0.322,and 0.357,which are better than the traditional TextRank method and Word2vec(experimental sample corpus)+TextRank+MMR model,with higher quality.
关 键 词:多文档文摘 抽取式文摘 TextRank算法 Word2Vec SIF
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.188.250.166