一种基于信息熵的Web页面主题信息抽取方法被引量：6

Extracting topic information of Web page based on entropy

出　　处：《计算机工程与应用》2007年第4期164-166,共3页Computer Engineering and Applications

摘　　要：提出了一种剪枝信息熵增较大结点的信息抽取方法。通过对HTML文档解析来构造DOM树,根据配置过滤掉不需处理的相关内容并建立语义模型树,最后对熵增超过阈值的结点进行剪枝并输出抽取的主题信息页面,初步实验结果验证了用这种方法进行Web页面信息抽取的有效性。方法的数学模型简单可靠,基本不需要人工干预即可完成主题信息抽取。可应用于Web数据挖掘系统以及PDA等移动设备的信息获取方面。This paper presents a method of information extraction by pruning the nodes of which information entropy production reach a certain extent.Firstly,a DOM tree is constructed by parsing HTML document.Then,the nodes which don＇t need to be dealt with are filtrated out,and a STU tree is created.Lastly,the nodes whose information entropy＇s increase overtops the threshold value are pruned,and the topic information of the Web pages is obtained.The primary experiment result proves the validity of the method using for extracting Web page＇s information.The mathematical model of the method is simple and credible,so it can work automatically without intervention of people.This method can be applied to Web data mining and information extraction for mobile device such as PDA etc.

关键词：WEB 抽取 STU-DOM树信息熵

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于信息熵的Web页面主题信息抽取方法被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种基于信息熵的Web页面主题信息抽取方法 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种基于信息熵的Web页面主题信息抽取方法被引量：6