基于网页源码结构理解的自适应爬虫代码生成方法被引量：3

Self-adaptive Web crawler code generation method based on webpage source code structure comprehension

作　　者：刘耀[1] 刘茹翟雨 LIU Yao;LIU Ru;ZHAI Yu(Information Technology Support Center,Institute of Scientific and Technical Information of China,Beijing 100038,China;School of Software and Microelectronics,Peking University,Beijing 102600,China)

机构地区：[1]中国科学技术信息研究所信息技术支持中心,北京100038 [2]北京大学软件与微电子学院,北京102600

出　　处：《计算机应用》2023年第6期1779-1784,共6页journal of Computer Applications

基　　金：国家社会科学基金资助项目(21BTQ011);国家重点研发计划项目(2018YFB143502)。

摘　　要：针对网页频繁改版带来的网页源码变动,尤其是文章日期、正文或来源机构等网页源码中目标实体的元素结构或属性标识变动所引起的爬虫代码失效、人力维护成本过高的问题,提出一种基于网页源码结构理解的自适应爬虫代码生成方法。首先,通过分析网页结构特征变动规律提取相应爬虫代码;然后,利用Encoder-Decoder模型表征网页源码及代码的变动,通过融合网页源码自身结构语义特征、网页源码变动特征及网页代码变动特征,得到自适应代码生成模型;最后,完善自适应系统的感知、生成和激活机制,从而形成具有自适应处理能力的爬虫系统。经实验验证,所提自适应代码生成模型的最终准确率为78.5%,与TF-IDF+Seq2Seq和TriDNR+Seq2Seq两种生成模型相比,所提模型在网页源码变动的表示和代码生成的有效性上具有一定的优越性。因此,所提方法能够解决网页源码变动引起的爬虫代码运行问题,为网络资源获取即爬虫技术的自适应处理能力提供新思路。To address the problems of Web crawler code failure and high manual maintenance cost caused by webpage source code changes led by frequent webpage redesigns,especially changes in element structures or attribute identifiers of target entities such as article dates,main body of text or source organizations,a self-adaptive Web crawler code generation method based on webpage source code structure comprehension was proposed.Firstly,the corresponding Web crawler code was extracted by analyzing the change patterns of webpage structural characteristics.Secondly,the changes in the webpage source code and code were represented by the Encoder-Decoder model.By fusing the semantic features of the webpage source code structure,the features of webpage source code changes and the features of webpage code changes,an adaptive code generation model was obtained.Finally,the perception,generation and activation mechanisms of the adaptive system were improved to form a Web crawler system with adaptive processing capability.Compared with TF-IDF+Seq2Seq and TriDNR+Seq2Seq models,the proposed adaptive code generation model was experimentally verified to show the superiority in the representation of webpage source code changes and the effectiveness of code generation with a final accuracy of 78.5%.With the proposed method,the Web crawler code operation problems caused by the webpage source code changes could be solved,and a new idea for the adaptive processing capability of Web resource acquisition—Web crawler technique was provided.

关键词：资源获取网页改版超文本标记语言网页源码理解自适应网络爬虫

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于网页源码结构理解的自适应爬虫代码生成方法被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于网页源码结构理解的自适应爬虫代码生成方法 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于网页源码结构理解的自适应爬虫代码生成方法被引量：3