基于概率模型的Web信息抽取被引量：4

Web Information Extraction Based on Probabilistic Model

出　　处：《模式识别与人工智能》2010年第6期847-855,共9页Pattern Recognition and Artificial Intelligence

基　　金：国家科技支撑计划项目资助(No.2007BAH08B02)

摘　　要：针对Web网页的二维结构和内容的特点,提出一种树型结构分层条件随机场(TH-CRFs)来进行Web对象的抽取.首先,从网页结构和内容两个方面使用改进多特征向量空间模型来表示网页的特征;第二,引入布尔模型和多规则属性来更好地表示Web对象结构与语义的特征;第三,利用TH-CRFs来进行Web对象的信息提取,从而找出相关的招聘信息并优化模型训练的效率.通过实验并与现有的Web信息抽取模型对比,结果表明,基于TH-CRFs的Web信息抽取的准确率已有效改善,同时抽取的时间复杂度也得到降低.According to the structure and the content features of web pages,a model named tree-structured hierarchical conditional random fields（TH-CRFs） is proposed.Firstly,a multi-feature vector space model is proposed to represent the features of the web pages from the facets of the page structure and the content.Secondly,the Boolean model and multi-rules are introduced to denote the features for a better representation of the web objects.Thirdly,an optimal web objects information extraction based on the TH-CRFs is performed to find out the recruitment knowledge and optimize the efficiency of the training.Finally,the proposed model is compared with the existing approaches for web objects information extraction.The experimental results show that the accuracy of the TH-CRFs for the web objects information extraction is significantly improved,and the time complexity is decreased.

关键词：WEB对象条件随机场(CRFs) 信息抽取(IE)

分类号：TP393.09[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于概率模型的Web信息抽取被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于概率模型的Web信息抽取 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于概率模型的Web信息抽取被引量：4