基于LSA和pLSA的多文档自动文摘  被引量:6

Multi-Documentation Summarization Based on LSA and pLSA

在线阅读下载全文

作  者:俞辉[1] 

机构地区:[1]中国石油大学计算机与通信工程学院,山东东营257061

出  处:《计算机工程与科学》2009年第9期108-111,共4页Computer Engineering & Science

摘  要:本文提出一种基于LSA和pLSA的多文档自动文摘策略。首先,将多个文档切分成自然段,以自然段作为聚类单位。采用了新的特征提取方法构建词-自然段矩阵,利用LSA对词-自然段矩阵进行奇异值分解,使得向量空间模型中的高维表示变成在潜在语义空间中的低维表示。然后,采用pLSA将数据转换成概率统计模型来计算。在文摘生成的过程中采用基于质心的文摘句挑选办法得到文摘并输出。实验表明,本文提出的方法有效地提高了生成文摘的质量。This paper proposes a new strategy of multi-document summarization based on the latent semantic analysis and the probabilistic latent semantic analysis. Firstly, all documents are split to paragraphs, and they are used to clustering. New features are used to construct word-paragraph matrices. Latent semantic analysis which stems from linear algebra performs a singular value decomposition of word-paragraph matrices, so that unimportant information is filtered and the high dimensional representation in the vector space model is changed to low dimensional representation in the latent semantic space. Co-occurrence data is changed to the probabilistic model by the probabilistic latent semantic analysis. In the period of summarization, the method of centroid-based summarization is used to generate summarization. The experimental results show that the performance of summarization is improved.

关 键 词:多文档自动文摘 潜在语义分析 奇异值分解 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象