基于Hadoop平台的农产品价格数据爬取和存储系统的研究  被引量:4

RESEARCH ON DATA CRAWLING AND STORAGE SYSTEM OF AGRICULTURAL PRODUCT PRICE BASED ON HADOOP PLATFORM

在线阅读下载全文

作  者:杨晓东[1] 郜鲁涛[1] 杨林楠[1] 刘建阳 Yang Xiaodong Gao Lutao Yang Linnan Liu Jianyang(College of Basic Science and Information Engineering, Yunnan Agriculture University, Kunming 650201, Yunnan, China Yunnan Information Technology Development Center, Kunming 650228, Yunnan, China)

机构地区:[1]云南农业大学基础与信息工程学院,云南昆明650201 [2]云南省信息技术发展中心,云南昆明650228

出  处:《计算机应用与软件》2017年第3期76-80,共5页Computer Applications and Software

基  金:国家"十二五"科技支撑计划课题(2014BAD10B03)

摘  要:目前许多大型农贸市场和农业信息商务平台都在实时发布每天各地区不同农产品的价格数据。针对数据更新快、数据量大、数据形式多样,使数据的爬取和存储以及后续的分析工作变得困难,提出基于Hadoop的农产品价格爬取及存储系统。利用HttpClient框架结合线程池通过多线程爬取,爬取结束后执行完整性检查,过滤出信息不完整的网页,进行二次爬取直到信息完整。对爬取到的网页使用正则表达式进行解析和清洗,提取有用的数据,以文本文件的形式存入HDFS(Hadoop Distributed File System),此后爬取到的数据以追加的方式写入HDFS文件中。实验表明HDFS的写入性能满足爬取数据不断递增的现状,副本数越少,数据块越大,写入性能越好。At present, many large farm product markets and agricultural information commerce platforms release the information of agricultural product prices from different regions in real-time each day. Because of a large number Of various fast-updating data, the data crawling and storage as well as the following analysis work come to be difficult. Therefore, we put forward a data crawling and storage system of agricultural product price based on Hadoop. We implement multi-threaded crawling by HttpClient framework combined with thread pool and finish integrity checking. After filtering out the web pages whose information is incomplete, we crawl again until the information comes to be complete. We analyze and clean the crawled web pages by regular expression, and save the useful extracted data in the form of text file into HDFS ( Hadoop Distributed File System). The data crawled later is supplemented into HDFS. Experiment shows that the writing performance of HDFS can satisfy the incremental crawling data. The less duplicates are, the bigger the data block is, then the better the writing performance is.

关 键 词:分布式系统 爬虫 HADOOP HDFS 正则表达式 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象