检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:丁艳辉[1] 李庆忠[1] 董永权[1] 彭朝晖[1]
机构地区:[1]山东大学计算机科学与技术学院,济南250014
出 处:《计算机学报》2010年第2期267-278,共12页Chinese Journal of Computers
基 金:国家自然科学基金(90818001);山东省自然科学基金(Y2007G24)资助~~
摘 要:大规模Web信息抽取需要准确、自动地从众多相关网站上抽取Web数据对象.现有的Web信息抽取方法主要针对单个网站进行处理,无法适应大规模Web信息抽取的需要.调查研究表明,有效地实现Web数据语义自动标注,结合现有的包装器生成技术,可以满足大规模Web信息抽取的要求.文中提出一种基于集成学习和二维关联边条件随机场的Web数据语义自动标注方法,首先,利用已抽取的信息和目标网站训练页面中呈现的特征构造多个分类器,使用Dempster合成法则合并分类器结果,区分训练页面中的属性标签和数据元素;然后,利用二维关联边条件随机场模型对Web数据元素间的长距离依赖联系和短距离依赖联系进行建模,实现数据元素的自动语义标注.通过在多个领域真实数据集上的实验结果表明,所提出的方法可以高效地解决Web数据语义自动标注问题,满足大规模Web信息抽取的需要.Large-scale Web information extraction needs to extract information from many Web sites accurately and automatically. However, most current Web information extraction methods place emphasis on single Web site, which causes that they can't meet the need of large-scale Web information extraction. The empirical study shows that automatic semantic annotation of Web da- ta, combined with current wrapper learning techniques, may meet the need of large-scale Web in- formation extraction. In this paper, a method based on ensemble learning and two-dimensional Correlative-Chain Conditional Random Fields (2DCC-CRFs) is proposed to solve the problem of automatic semantic annotation of Web data. Firstly, several classifiers based on different kinds of features can be built by analyzing the previously extracted data and sample Web pages; Then, at- tribute tags and Web data elements can be identified by combining multiple classifiers using Dempster-Shafer theory of evidence; Finally, 2DCC-CRFs is built to do semantic annotation of Web data element automatically, which extends a classic model, 2DCRFs, by adding correlative edges. Experimental results using a large number of real-world data collected from diverse do- mains show that the proposed approach can do automatic semantic annotation of Web data effi- ciently, which can meet the need of large-scale Web information extraction.
关 键 词:WEB信息抽取 语义标注 集成学习 条件随机场 长距离依赖
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28