基于微博API的分布式抓取技术被引量：7

A Distributed Data-Crawling Technology for Microblog API

作　　者：陈舜华[1] 王晓彤[1] 郝志峰[1] 蔡瑞初[1] 肖晓军卢宇

机构地区：[1]广东工业大学计算机学院,广州510006 [2]广州优亿信息科技有限公司,广州510630

出　　处：《电信科学》2013年第8期146-150,155,共6页Telecommunications Science

摘　　要：随着微博用户的迅猛增长,越来越多的人希望从用户的行为和微博内容中挖掘有趣的模式。针对如何对微博数据进行有效合理的采集,提出了基于微博API的分布式抓取技术,通过模拟微博登录自动授权,合理控制API的调用频次,结合任务分配控制器高效地获取微博数据。该分布式抓取技术还结合时间触发和内存数据库技术实现重复控制,避免了数据的重复爬取和重复存储,提高了系统的性能。本分布式抓取技术具有可扩展性高、任务分配明确、效率高、多种爬取策略适应不同的爬取需求等特点。新浪微博数据爬取实例验证了该技术的可行性。As more and more users begin to use microblog, people eagerly want to dig interesting patterns from the microblog data. How to efficiently collect data from the service provider is one of the main challenges. To address this issue, a distributed crawling solution based on microblog API was present. The distributed crawling solution simulates microblog login, automatically gets authorized, and control the invoked frequency of the API with a task controller. A time trigger method with memory database was also proposed to avoid extra trivial data duplication and improve efficiency of the system. In the distributed framework, the crawling tasks can be assigned to distributed clients independently, which ensures the high scalability and flexibility of the crawling procedure. The feasibility of the crawler technology according to Sina microblog instance was verified.

关键词：新浪微博爬取策略分布式爬取微博API

分类号：TP311.13[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于微博API的分布式抓取技术被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于微博API的分布式抓取技术 被引量：7

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于微博API的分布式抓取技术被引量：7