检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈舜华[1] 王晓彤[1] 郝志峰[1] 蔡瑞初[1] 肖晓军 卢宇
机构地区:[1]广东工业大学计算机学院,广州510006 [2]广州优亿信息科技有限公司,广州510630
出 处:《电信科学》2013年第8期146-150,155,共6页Telecommunications Science
摘 要:随着微博用户的迅猛增长,越来越多的人希望从用户的行为和微博内容中挖掘有趣的模式。针对如何对微博数据进行有效合理的采集,提出了基于微博API的分布式抓取技术,通过模拟微博登录自动授权,合理控制API的调用频次,结合任务分配控制器高效地获取微博数据。该分布式抓取技术还结合时间触发和内存数据库技术实现重复控制,避免了数据的重复爬取和重复存储,提高了系统的性能。本分布式抓取技术具有可扩展性高、任务分配明确、效率高、多种爬取策略适应不同的爬取需求等特点。新浪微博数据爬取实例验证了该技术的可行性。As more and more users begin to use microblog, people eagerly want to dig interesting patterns from the microblog data. How to efficiently collect data from the service provider is one of the main challenges. To address this issue, a distributed crawling solution based on microblog API was present. The distributed crawling solution simulates microblog login, automatically gets authorized, and control the invoked frequency of the API with a task controller. A time trigger method with memory database was also proposed to avoid extra trivial data duplication and improve efficiency of the system. In the distributed framework, the crawling tasks can be assigned to distributed clients independently, which ensures the high scalability and flexibility of the crawling procedure. The feasibility of the crawler technology according to Sina microblog instance was verified.
分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.145