基于Gecko浏览器内核的谷歌翻译爬虫

Google Translation Crawler Based on Gecko Browser Kernel

作　　者：李健[1] LI Jian(PLA Strategic Support Force Information Engineering University,Luoyang 471003)

出　　处：《现代计算机》2021年第18期32-37,共6页Modern Computer

基　　金：国家自然科学基金重大项目:多语言言语数据的获取、标注和分析研究(No.11590771)。

摘　　要：异步加载技术在Web中广泛使用,这给网络爬虫开发带来一些困难。本文提出一种基于Gecko浏览器内核的异步数据采集方法。此方法模拟浏览器加载网页,完成用户输入,触发执行脚本,最终获得目标数据。应用上述方法,设计并实现了面向谷歌翻译的专用爬虫,能够批量生成双语平行语料,并采用轮询检测机制进一步提高爬虫效率。实验结果表明:本文所提出的解决方案是行之有效的,如何模拟用户操作是实现爬虫的基础,如何检测目标数据是提高效率的关键。Asynchronous loading technology has been widely used in web,which brings some difficulties to the development of web crawler.An asyn⁃chronous data acquisition method based on Gecko browser kernel is proposed.This method simulates the browser to load the web page,accomplishes the user input,triggers the execution of the script,and finally obtains the target data.Using the above method,a special crawler for Google Translation is designed and implemented,which can generate bilingual parallel corpus in batch,and polling detection mecha⁃nism is adopted to further improve the efficiency.The experimental results show that:the solution proposed in this paper is effective,how to simulate user's operation is the basis to implement the crawler,and how to detect the target data is the key to improve the efficiency.

关键词：网络爬虫异步加载浏览器内核谷歌翻译

分类号：TP393.092[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Gecko浏览器内核的谷歌翻译爬虫

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于Gecko浏览器内核的谷歌翻译爬虫

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索