Web文档清洗系统中HTML解析器的开发被引量：7

A HTML Parser for Web Cleaning

作　　者：王强[1,2] 王继成[1,2] 武港山[1,2] 张福炎[1,2]

机构地区：[1]南京大学计算机科学与技术系,江苏南京210093 [2]南京大学计算机软件新技术国家重点实验室,江苏南京210093

出　　处：《计算机应用研究》2002年第2期54-57,共4页Application Research of Computers

基　　金：国家自然科学基金资助项目 (60 0 73 0 3 0 );国家教育部"现代远程教育关键技术研究重点项目" ;日本富士通研究所"Web文档清洗技术研究"资助项目

摘　　要：对于组建一个面向Web的信息系统来说 ,去除掉脚本、广告链接以及导航链接等无用数据 ,将提高信息存储和检索的效率 ;同时 ,基于语义对Web文档进行合并和分割也会有助于信息的管理 ,这些都是Web文档清洗系统的任务。在Web文档清洗中 ,无论是脱机的规则学习还是联机的文档清洗 ,都需要建立在对Web文档的结构和内容进行分析的基础之上。从HTML解析的一般概念入手 ,结合Web文档清洗系统的需求 ,描述了一个自主开发的HTML解析器的结构 ,并对其组成部分 :词典。When we are engaged in constructing a Web information system, large numbers of noises in Web pages are encountered, such as scripts,advertisement links and navigation links.So we need to remove those noises to ensure the efficency of information retrieval and storage.Moreover, semantic oriented divisions or combinations of Web pages will also facilitate organization of Web information.All these tasks constitute a Web Cleaning system.An important component of a Web Cleaning system is an HTML Parser, by which the text stream of an html document can be transformed into a syntax tree so that the structure and content of the document can be manipulated more easily.The HTML Parser consists of three parts: first, a dictionary in which syntax of the language of html is stored; second , a lexicon which is used to scan html text stream and find tokens; third, a parser which constructs an html syntax tree according to tokens provided by the lexicon. In this paper, we describe the design of the three parts in detail and briefly introduce the whole Web Cleaning system where the HTML Parser is implemented as a DLL (Dynamic Link Library).

关键词：HTML解析器词法器递归下降文档清洗系统 WEB INTERNET

分类号：TP393.4[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Web文档清洗系统中HTML解析器的开发被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Web文档清洗系统中HTML解析器的开发 被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

Web文档清洗系统中HTML解析器的开发被引量：7