网页信息抽取及其自动文本分类的实现被引量：7

Extraction of Homepage Text Information and Realization of Text Automatic Categorization

机构地区：[1]江苏科技大学电子信息学院,江苏镇江212003 [2]中国科学院声学研究所,北京100080

出　　处：《计算机技术与发展》2008年第10期37-39,共3页Computer Technology and Development

基　　金：国家自然科学基金(60573064)

摘　　要：Web页面中常包含非主题信息的内容,网页必须剔除这些无用的信息后才能形成有用的文本信息。文本分类对文本信息的进一步加工处理至关重要,是信息搜索领域的另一研究课题。为了剔除网页中的无用信息,提出一种基于HTML自身结构特点的网页正文信息抽取方法,同时结合文章标题信息,实现文本自动分类的简易分类方法。该方法可以提高网页正文提取及其自动文本分类的效率。实验证明,该方法是可行的。The non-subject information is often contained in the Web homepage. The useless information must be rejected in the process of forming the useful text information. The text classification is very important to the text information further processing. It has become another research topic in the information search field. Proposed a method of extracting the text information based on the HTML unique feature, simultaneously, and unified the article title information, and realized the text automatic categorization. The method is proved to feasible and realizable to enhance the homepage extraction and text categorization through the detailed demonstration.

关键词：标记文本分类信息抽取

分类号：TP393[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

网页信息抽取及其自动文本分类的实现被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

网页信息抽取及其自动文本分类的实现 被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

网页信息抽取及其自动文本分类的实现被引量：7