一种DOM树标签路径和行块密度结合的Web信息抽取方法被引量：5

Web information extraction based on label path of DOM tree and block density

出　　处：《智能计算机与应用》2017年第4期13-16,20,共5页Intelligent Computer and Applications

基　　金：山西大学商务学院2016年科研基金(2016008)

摘　　要：本文提出了一种标签路径和行块分布函数相结合的信息抽取方法来实现Web页面的信息抽取。该方法将Web页面解析成DOM树,使用视觉特征和标签过滤的规则将树进行剪枝,引入标签路径特征的方法粗略划分出网页的正文内容和噪音内容,最终使用行块分布函数的方法进行抽取,获得正文文本。实验结果表明,这种抽取方法有效地防止了正文内容误删及噪音内容漏删的现象,使得提取的正文信息更加准确,准确度达到91%,召回率达到95%,F值达到93%。本算法对于包含过多短文本的网页抽取的准确度还有待提高。In this paper, an information extraction method combining tag path and block distribution function is proposed to extract information from Web pages. The Web page is parsed into a DOM tree in first step. Secondly, the DOM tree is pruned by using visual features and label filtering rules. And then introducing label path characteristics, Web information is roughly divided into two parts： text content and noise content. Finally, using row block distribution function to extract text, the text is utterly obtained. The experimental results show that this method can prevent that the text is mistaken to delete and the noise content is missed to delete effectively, making the extraction of text information more accurately. The results shows that the precision reaches 91%, the recall rate 95%, F score 93%. The accuracy of the algorithm for Web pages which are containing too many short texts still has to be improved.

关键词：DOM树视觉特征标签路径特征行块分布函数

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种DOM树标签路径和行块密度结合的Web信息抽取方法被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种DOM树标签路径和行块密度结合的Web信息抽取方法 被引量：5

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种DOM树标签路径和行块密度结合的Web信息抽取方法被引量：5