基于网页格式信息量的博客文章和评论抽取模型被引量：15

Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction

作　　者：曹冬林[1,2,3] 廖祥文[1,2] 许洪波[1] 白硕[1]

机构地区：[1]中国科学院计算技术研究所网络科学与技术研究部,北京100190 [2]中国科学院研究生院,北京100049 [3]厦门大学智能科学系,福建厦门361005

出　　处：《软件学报》2009年第5期1282-1291,共10页Journal of Software

基　　金：国家重点基础研究发展计划(973)Nos.2004CB318109,2007CB311100;国家高技术研究发展计划(863)No.2007AA01Z441~~

摘　　要：从信息论的角度出发,提出了一个基于网页格式信息量的博客文章和评论抽取模型.首先,结合网页视觉上的位置信息和文本的有效信息来定位网页正文.其次,利用博客网页中的格式信息作为信息单元并计算每个信息块所包含的格式信息量,通过计算最小切分位置信息量来切分正文中的文章和评论.该模型具有与语言无关的特点,因此具有一定的通用性.实验结果表明,该模型在博客正文定位和正文切分方面达到了较高的精确率.Based on the information theory, this paper presents a model based on Web format information quantity in blog information extraction. First, the vision information in blog Web page and the effective text information are combined to locate the main text which represents the theme of the blog Web page. Second, the format information ofblog Web page is used to calculate the information quantity of each block and the minimal separating information quantity of separate position is used to detect the boundary of posts and comments in the main text. This model is language insensitive and can be used in a lot of blogs which are written in different natural languages. Experimental results show that this method achieves high precision in locating main text and separating the post and comment.

关键词：博客信息抽取最小正文子树有效信息率网页格式信息视觉信息切分位置信息量

分类号：TP181[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于网页格式信息量的博客文章和评论抽取模型被引量：15

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于网页格式信息量的博客文章和评论抽取模型 被引量：15

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于网页格式信息量的博客文章和评论抽取模型被引量：15