基于网页图像分类的自动化网页正文抽取模型被引量：1

Web content extraction model based on automated image classification

作　　者：秦龙李晓戈穆诤辉[1,2] 李涛 QIN Long;LI Xiao-ge;MU Zheng-hui;LI Tao(School of Computer,Xi’an University of Posts and Telecommunications,Xi’an 710121,China;Key Laboratory of Shaanxi Province Network Data Analysis and Intelligent Processing,Xi’an University of Posts and Telecommunications,Xi’an 710121,China)

机构地区：[1]西安邮电大学计算机学院,陕西西安710121 [2]西安邮电大学陕西省网络数据分析与智能处理重点实验室,陕西西安710121

出　　处：《计算机工程与设计》2023年第2期386-392,共7页Computer Engineering and Design

基　　金：国家重点研发计划基金项目(2018YFB1402905);陕西省重点研发计划基金项目(2020GY-227)。

摘　　要：运用卷积神经网络技术提出一种基于网页图像分类的自动化网页正文抽取模型(I-AWCE)。通过分析现有网页类型及网页正文在网页中的位置和结构特点,将网页分为文章网页和列表网页。根据网页截屏图像在卷积神经网络模型中的分类结果,分别提出两种基于多特征融合的网页正文提取方法。实验结果表明,网页图像数据集在LeNet-5和预训练模型的效果最好;与Boilerpipe抽取模型相比,基于图像分类的自动化网页正文抽取模型具有较高的准确性,可以满足网页正文自动化抽取的实际需要。The automatic Web context extraction(I-AWCE) framework was proposed based on Web page image classification using convolutional neural network. The existing Web pages could be divided into two different types of article Web pages and list pages, by analyzing the characteristics of location and structure of web context. Webpage screenshot images were converted from webpages, these images were classified using the convolutional neural network model. The two types of Web context extraction methods based on fusing multi-feature were improved and applied for different types of webpages respectively. Experimental results show that LeNet-5 model and pre-trained model have better performance on Web page image classification. Boilerpipe extraction model was compared with ICAM model on same dataset. The proposed method shows higher accuracy than Boi-lerpipe model. It can meet the requirement of automatic Web context extraction.

关键词：图像分类网页正文抽取卷积神经网络残差网络预训练模型标准差文本长度

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于网页图像分类的自动化网页正文抽取模型被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于网页图像分类的自动化网页正文抽取模型 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于网页图像分类的自动化网页正文抽取模型被引量：1