基于集成学习和二维关联边条件随机场的Web数据语义标注方法被引量：6

Semantic Annotation of Web Data Based on Ensemble Learning and 2D Correlative-Chain Conditional Random Fields

出　　处：《计算机学报》2010年第2期267-278,共12页Chinese Journal of Computers

基　　金：国家自然科学基金(90818001);山东省自然科学基金(Y2007G24)资助~~

摘　　要：大规模Web信息抽取需要准确、自动地从众多相关网站上抽取Web数据对象.现有的Web信息抽取方法主要针对单个网站进行处理,无法适应大规模Web信息抽取的需要.调查研究表明,有效地实现Web数据语义自动标注,结合现有的包装器生成技术,可以满足大规模Web信息抽取的要求.文中提出一种基于集成学习和二维关联边条件随机场的Web数据语义自动标注方法,首先,利用已抽取的信息和目标网站训练页面中呈现的特征构造多个分类器,使用Dempster合成法则合并分类器结果,区分训练页面中的属性标签和数据元素;然后,利用二维关联边条件随机场模型对Web数据元素间的长距离依赖联系和短距离依赖联系进行建模,实现数据元素的自动语义标注.通过在多个领域真实数据集上的实验结果表明,所提出的方法可以高效地解决Web数据语义自动标注问题,满足大规模Web信息抽取的需要.Large-scale Web information extraction needs to extract information from many Web sites accurately and automatically. However, most current Web information extraction methods place emphasis on single Web site, which causes that they can＇t meet the need of large-scale Web information extraction. The empirical study shows that automatic semantic annotation of Web da- ta, combined with current wrapper learning techniques, may meet the need of large-scale Web in- formation extraction. In this paper, a method based on ensemble learning and two-dimensional Correlative-Chain Conditional Random Fields （2DCC-CRFs） is proposed to solve the problem of automatic semantic annotation of Web data. Firstly, several classifiers based on different kinds of features can be built by analyzing the previously extracted data and sample Web pages; Then, at- tribute tags and Web data elements can be identified by combining multiple classifiers using Dempster-Shafer theory of evidence; Finally, 2DCC-CRFs is built to do semantic annotation of Web data element automatically, which extends a classic model, 2DCRFs, by adding correlative edges. Experimental results using a large number of real-world data collected from diverse do- mains show that the proposed approach can do automatic semantic annotation of Web data effi- ciently, which can meet the need of large-scale Web information extraction.

关键词：WEB信息抽取语义标注集成学习条件随机场长距离依赖

分类号：TP393[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于集成学习和二维关联边条件随机场的Web数据语义标注方法被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于集成学习和二维关联边条件随机场的Web数据语义标注方法 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于集成学习和二维关联边条件随机场的Web数据语义标注方法被引量：6