基于多种策略的页面内容提取算法被引量：4

Web Content Extraction Based on Multiple Strategies

出　　处：《西南交通大学学报》2007年第4期473-477,共5页Journal of Southwest Jiaotong University

摘　　要：针对W eb页面存在与主题无关的噪音的问题,提出了基于页面结构与页面内容相结合的多策略页面内容提取算法.该算法根据改进的VIPS(基于视觉信息的页面分割算法)生成页面的块结构树,通过定义内聚度阈值和块结构树的最大深度,实现了块结构树中不同区域内不同分块粒度的要求;根据W eb页面提供的结构信息和内容信息提取块结构树叶子节点中的"主题"块和"主题相关"块;最后,对主题块和主题相关块的内容进行合并,提取页面的主要内容.实验表明,对任意下载、不同内容类型的页面,该算法都能有效地提取页面内容.In order to filter the noise in a web page, a new multi-strategy algorithm to extract the contents of a web page was proposed. With this algorithm, the granularity in different areas of the block tree of a web page established by the improved VIPS （ visual based page segment） algorithm is controlled by defining the permitted degree of coherence and the maximum depth of the block tree. In addition, ＂topic＂ or ＂topic-relevant＂ blocks among the leaves of the block tree can be extracted from the blocks＇ content information and structure information. Finally, the main content of a web page can be extracted by merging these blocks＇ contents. Experiments on the web pages of three sites indicates that the proposed algorithm is effective for extracting the contents of any type of web pages.

关键词：VIPS(基于视觉信息的页面分割算法) 内聚度最大深度内容信息结构信息

分类号：TP393.092[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多种策略的页面内容提取算法被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多种策略的页面内容提取算法 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于多种策略的页面内容提取算法被引量：4