多文档文摘语义单元自动去噪器的监督学习方法  

Supervised Learning of an Automatic Noisy Semantic Unit Filter for Multi-Document Summarization

在线阅读下载全文

作  者:龚书[1] 瞿有利[1] 田盛丰[1] 

机构地区:[1]北京交通大学计算机与信息技术学院,北京100044

出  处:《计算机研究与发展》2013年第4期873-882,共10页Journal of Computer Research and Development

基  金:国家自然科学基金项目(61105056);中央高校基本科研业务费专项基金项目(2011JBM231)

摘  要:多文档文摘的处理对象是存在噪音的文档集.现有文摘系统一般使用由人工设定阈值的固定阈值去噪器.但通过实验可见,不同文摘算法本身的抗噪能力各有高低,最优阈值随文档集、文摘算法、文本表示方法而改变,人工设定的固定阈值无法达到较好的通用性和去噪效果.为此,提出一种用于生成自动去噪器的监督学习方法,通过从人工文摘中自动获得标注信息,为语义单元提取多个特征,训练语义单元分类器而构成自动去噪器.可通用于不同文本表示所生成的语义单元,在不同多文档文摘系统的预处理阶段为任意文档集自动去除噪音语义单元.实验表明,该监督学习方法所生成的自动去噪器在不同文档集、文摘算法和文本表示方法下具有通用性,较好的去噪性能使各文摘算法的速度及所提取文摘的质量得到不同程度的提升.The target of multi-document summarization is a document set containing many noises.Most of the state-of-art summarization systems use fixed threshold-based noise filter with a manually selected threshold to filter out low frequency units. But according to the observation in experiments, the best threshold varies according to different document sets, summarization algorithms and text representations. These mean that a fixed threshold-based noise filter cannot achieve good robustness in different summarization settings which will lead to an unstable noise filtering efficiency. Therefore, a supervised learning method to generate automatic noise filter is proposed. Based on the labels extracted automatically from human written summaries and a set of selected features which can be used for different types of semantic units, a semantic unit classifier is trained to compose the automatic noise filter, which can be used for different types of semantic unit generated by different text representation methods, and can automatically filter out noisy semantic units at the preprocessing stage of multi-document summarization systems. Experiments show the robustness of the automatic noise filter generated by the supervised learning method under different document sets, summarization algorithms and text representations, and also show the improvements in the speed and summary quality of each summarization algorithms benefited from noise filtering.

关 键 词:自动去噪 监督学习 多文档文摘 文本表示 预处理 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象