检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘耀[1] 刘茹 翟雨 LIU Yao;LIU Ru;ZHAI Yu(Information Technology Support Center,Institute of Scientific and Technical Information of China,Beijing 100038,China;School of Software and Microelectronics,Peking University,Beijing 102600,China)
机构地区:[1]中国科学技术信息研究所信息技术支持中心,北京100038 [2]北京大学软件与微电子学院,北京102600
出 处:《计算机应用》2023年第6期1779-1784,共6页journal of Computer Applications
基 金:国家社会科学基金资助项目(21BTQ011);国家重点研发计划项目(2018YFB143502)。
摘 要:针对网页频繁改版带来的网页源码变动,尤其是文章日期、正文或来源机构等网页源码中目标实体的元素结构或属性标识变动所引起的爬虫代码失效、人力维护成本过高的问题,提出一种基于网页源码结构理解的自适应爬虫代码生成方法。首先,通过分析网页结构特征变动规律提取相应爬虫代码;然后,利用Encoder-Decoder模型表征网页源码及代码的变动,通过融合网页源码自身结构语义特征、网页源码变动特征及网页代码变动特征,得到自适应代码生成模型;最后,完善自适应系统的感知、生成和激活机制,从而形成具有自适应处理能力的爬虫系统。经实验验证,所提自适应代码生成模型的最终准确率为78.5%,与TF-IDF+Seq2Seq和TriDNR+Seq2Seq两种生成模型相比,所提模型在网页源码变动的表示和代码生成的有效性上具有一定的优越性。因此,所提方法能够解决网页源码变动引起的爬虫代码运行问题,为网络资源获取即爬虫技术的自适应处理能力提供新思路。To address the problems of Web crawler code failure and high manual maintenance cost caused by webpage source code changes led by frequent webpage redesigns,especially changes in element structures or attribute identifiers of target entities such as article dates,main body of text or source organizations,a self-adaptive Web crawler code generation method based on webpage source code structure comprehension was proposed.Firstly,the corresponding Web crawler code was extracted by analyzing the change patterns of webpage structural characteristics.Secondly,the changes in the webpage source code and code were represented by the Encoder-Decoder model.By fusing the semantic features of the webpage source code structure,the features of webpage source code changes and the features of webpage code changes,an adaptive code generation model was obtained.Finally,the perception,generation and activation mechanisms of the adaptive system were improved to form a Web crawler system with adaptive processing capability.Compared with TF-IDF+Seq2Seq and TriDNR+Seq2Seq models,the proposed adaptive code generation model was experimentally verified to show the superiority in the representation of webpage source code changes and the effectiveness of code generation with a final accuracy of 78.5%.With the proposed method,the Web crawler code operation problems caused by the webpage source code changes could be solved,and a new idea for the adaptive processing capability of Web resource acquisition—Web crawler technique was provided.
关 键 词:资源获取 网页改版 超文本标记语言 网页源码理解 自适应网络爬虫
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49