基于稀疏表示的多文档自动摘要

Multi-document Automatic Summarization Based on Sparse Representation

作　　者：钱玲龙武娇王人锋陆慧娟[1] QIAN Ling-long;WU Jiao;WANG Ren-feng;LU Hui-juan(China Jiliang University,Hangzhou 310018,China)

机构地区：[1]中国计量大学,杭州310018

出　　处：《计算机科学》2020年第S02期97-105,共9页Computer Science

基　　金：国家自然科学基金(61272315,61602431);浙江省自然科学基金(LQ20F030015);国家级大学生创新创业训练计划-基于自然语言处理的智能阅读模型(201810356020)。

摘　　要：文档自动摘要是自然语言处理领域中的重要任务,受限于难以准确理解文档语义,大多通过词频、关键词等人工特征对文档句子进行重要程度排序,以此提取摘要。受稀疏表示理论启发,提出了一种基于稀疏表示的动态语义空间划分算法。算法对初始划分的语义子空间进行字典学习,利用所得字典对所有句向量进行稀疏重构,从而将各句向量动态调整至重构误差最小的划分,迭代地实现语义空间的重划分。对于划分后语义子空间内摘要句的提取,提出了一种基于稀疏相似度排序的自动摘要提取算法。将各语义子空间的所有句向量作为字典原子,通过稀疏重构,得到能体现句子对其他句子语义表征程度的稀疏相似度,以各句累积稀疏相似度作为衡量句子表征空间语义信息能力的指标,依据其排序来提取摘要句。在猫途鹰网站热门景点旅游评论数据集上进行了实验,结果表明语义空间重构误差快速迭代5次即可稳定收敛且平均有效降低重构误差约17%,且算法对数据维度不敏感,所提摘要避免了重复提取冗余度大、重复性高的文本,是一种有效的自动摘要方法。Automatic document summary is an important task in the field of natural language processing.Limited by the difficulty of accurately understanding the semantics of documents,most of the documents are sorted by artificial features,such as word frequency and keywords,to extract the abstract.Inspired by the theory of sparse representation,a dynamic semantic space partition algorithm based on sparse representation is proposed.The algorithm performs dictionary learning on the initially divided semantic subspace,uses the obtained dictionary to sparsely reconstruct the sentence vector.Dynamically adjusts it to the division which has the smallest reconstruction error.Iteratively realizes the re-division of the semantic space.For abstracting sentences in the divided semantic subspace,an automatic extraction algorithm based on sparse similarity ranking is proposed.All sentence vectors in each semantic subspace are viewed as dictionary atoms.Through sparse reconstruction,the sparse similarity can be obtained which reflects the degree of semantic representation of one sentences to others.The cumulative sparse similarity of each sentence to other sentences is used as a metric to measure the ability of the sentence to represent the spatial semantic information.Ranking the cumulative sparse similarity,and then extract the required top N sentences.The experimental results on the travel review data set of popular attractions on the TripAdvisor website show that the semantic space reconstruction error can be rapidly reduced after 5 iterations,remain stable which shows the convergence.Except for effectively reduce the reconstruction error by nearly 17%,the algorithm is also not sensitive to data dimensions.The proposed summary avoids repeated abstraction of redundant and highly repetitive text,which is an effective multi-document automatic summarization method.

关键词：自动摘要字典学习稀疏重构

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于稀疏表示的多文档自动摘要

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于稀疏表示的多文档自动摘要

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索