检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]北京交通大学计算机与信息技术学院,北京100044
出 处:《南京大学学报(自然科学版)》2011年第4期398-406,共9页Journal of Nanjing University(Natural Science)
基 金:教育部科学技术研究重点项目(108126);国家自然科学基金(10871019/a0107)
摘 要:多文档文摘作为自然语言处理领域的重要技术之一,能从不同角度辅助用户实现高效的信息获取.由于文档集合内的内容往往来自不同的信息源,文本之间通常存在丰富而复杂的语义关系.常用的基于词的文档表示法,难以为文摘的语义分析过程提供充足而准确的数据信息.为此,我们提出使用维基百科——当今世界最大的在线概念语料库——为多文档文摘的提取提供语义支持.一方面,我们通过提取文档中的维基概念,生成准确一致的句子表示形式.另一方面,在计算句子特征时,我们利用维基词条的首段指导机器文摘的提取.我们首先通过计算概念在维基中的全局相关性和当前文档集内的局部相关性,获取概念的权重.然后在维基概念表示的基础上,为文档中的句子提取多种基于维基的特征,并最后用于文摘生成.在实验中,我们依次用各个维基特征独立生成文摘,并使用ROUGE(Recall-Oriented Understudy for Gisting Evaluation,面向召回率的要点评估)指标评价文摘质量.通过比较,实验验证了维基词条首段能较好的提升文摘质量.As an importance technique of natural language processing,multi-documents summarization can facilitate users' information retrieval processes.As the documents in a collection are always collected from different resources,there exist abundant and also complex semantic relations inside a document collection.It's hard for the widely used word-based text representation to provide sufficient and accurate information for semantic analysis in summarization process.Thus,we try to use Wikipedia,which has extensive concepts coverage,to extract the concept-based representation of documents.We assess the importance of concepts using both global and local information.The global relatedness of concepts is based on Wikipedia's link structure,while the local relatedness is calculated based on concepts' co-occurrence in sentence.Three wiki-based features are proposed: The first one is the widely used sentence salience feature based on Markov Chain.The other two are both based on sentence similarity with first paragraphs of concept articles in Wikipedia,but one using all concepts occurring in collection while the other using only other contained in sentence itself.Finally we linearly combined these features to select important sentences,which are then concatenated to form summary.We compared these features in experiments,and proved that the first paragraph of related concepts' Wikipedia articles can bring better summary quality.
分 类 号:TP39[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.30