基于双层决策的新闻网页正文精确抽取  被引量:16

Precise Content Extraction from News Web Page Based on Decisions of Two Layers

在线阅读下载全文

作  者:胡国平[1] 张巍[1] 王仁华[1] 

机构地区:[1]中国科学技术大学电子工程与信息科学系讯飞语音实验室,安徽合肥230027

出  处:《中文信息学报》2006年第6期1-9,103,共10页Journal of Chinese Information Processing

基  金:国家自然科学基金资助项目(69975018)

摘  要:本文提出了基于双层决策的新闻网页正文的精确抽取算法,双层决策是指对新闻网页正文所在区域的全局范围决策和对正文范围内每段文字是否确是正文的局部内容决策。首先根据实际应用的需要给出了新闻网页正文的严格界定,然后分析了新闻网页及其正文的特性,提出了基于双层决策的正文抽取策略,基于特征向量提取和决策树学习算法对上述双层决策进行了建模,并在国内10个主要新闻网站的1687个新闻页面上开展了模型训练和测试实验。实验结果表明,上述基于双层决策的方法能够精确地抽取出新闻网页的正文,最终正文抽取与人工标注不完全一致的网页比例仅为18.14%,比单纯局部正文内容决策的方法相对下降了29.85%,同时抽取误差率大于10%的网页比例更是仅为7.11%,满足了实际应用的需要。This paper concerns content extraction from news web pages based on decisions of two layers. The first layer of decision is introduced to predict the scope of content in a webpage, and the second layer is employed to judge whether the paragraph within predicted scope is content or not. We firstly present a strict definition of content for web pages orienting to the practical applications, then analyze the characteristics of news web pages and their contents. Based on the analysis, we propose a content extraction method based on decisions of two layers, and carry out experiments on a corpus of 1867 HTMLs collected from 10 main news web sites in China. The experiment results show that our method can predict the content of news web pages quite well: the percentage of web pages which contain mismatching in extracted content is only 18.14%, which decreases 29. 85% compared to that just based on the second layer prediction, and only 7. 11% of extracted pages are with more than 10% mismatching,indicating that this method could be applied to practical applications.

关 键 词:计算机应用 中文信息处理 信息抽取 特征向量 决策树 正文抽取 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象