页面图文模型与元素特征归纳

Picture-text webpage model and page element feature induction

出　　处：《计算机工程与科学》2013年第4期136-143,共8页Computer Engineering & Science

基　　金：国家863计划资助项目(2010AA012404)

摘　　要：针对以图文内容为核心的页面信息抽取,以形式化的方式提出了对页面进行元素分析的理论模型。通过定义基础元素集与变换规则,页面图文模型简化了页面DOM树结构,并展现出页面内元素的图文特征。在此基础上,通过定义元素分类相似度,从页面图文模型的元素特征中进行优选,归纳最佳分类特征,提出并实现了获取最佳分类特征集与识别阈值的算法。实验结果表明,页面图文模型简化了页面元素的规模,特征集归纳算法能够在较小的学习成本下获得理想的分类精度。According to the graphic-text content as the core of the page information extraction, this paper in a formal way forward on the page for elemental analysis of theoretical model. Through the definition of basic elements and rules of transformation, graphic-text page model with tree structure to show the page elements within the text and graphic features. The graphic-text page model elements in many features, by defining the elements classification of similarity, is proposed in this paper to obtain the best classification feature set and the recognition threshold method and gives the algorithm implementation. The experimental results show that, the graphic-text page model simplifies the page element size, feature set in smaller learning costs induction can achieve ideal classification accuracy.

关键词：页面信息抽取页面元素图文模型特征归纳

分类号：TP393.09[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

页面图文模型与元素特征归纳

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

页面图文模型与元素特征归纳

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索