基于Web的表格信息抽取研究被引量：6

Study on Tables Information Extraction Based on Web

出　　处：《计算机技术与发展》2010年第2期217-220,共4页Computer Technology and Development

基　　金：安徽省自然科学研究重点项目(2005KJ004ZD)

摘　　要：如今,Web成为了网络信息的主要平台。根据研究发现,表格在Web文本中被经常使用。正因为表格形式简洁并且含有丰富的信息,自动理解表格在知识管理、信息检索、Web挖掘等应用中有着广泛的用途,所以研究Web表格信息抽取有着重要的现实意义。互联网上有大量信息采用HTML表格表示,由于HTML不描述数据的内容,机器不能理解和查询。论文首先将HTML文档转换为XML文档,结合本体形成启发式规则,对表格定位、表格结构识别两个关键技术进行了分析。在此基础上,利用HTML表格属性,将HTML表格标准化,从而适用于复杂表格的信息抽取。Nowadays, web becomes the main information resource. According to the report, tables are used frequently in web documents. Since tables are inherently concise as well as information rich, the automatic understanding of tables has many applications including knowledge management,information retrieval,web mining and so on. Study on tables information extraction based on web has an impor- tant practical significance. A large amount of information available on the web is formatted in HTML tables, which are not content - oriented, and are not suitable for understanding and query by machines. In this paper, firstly transform HTML documents to XML documents and combinate ontology to discover heuristics. Then two key technologies are analysed, including web table detection, web table structure recognition. On this basis,we normalize the HTML tables according to the attributes of HTML tables and thus this approach is appropriate to extracte complicated tables information.

关键词：HTML表格信息抽取 WEB XML

分类号：TP393[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Web的表格信息抽取研究被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Web的表格信息抽取研究 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于Web的表格信息抽取研究被引量：6