Hadoop环境下基于SparkSQL海量自动站数据查询统计初探  被引量:12

Query and Statistical Analysis of Mass Automatic Station Data Based on SparkSQL in Hadoop Environment

在线阅读下载全文

作  者:黄志 詹利群 任晓炜 李涛 Huang Zhi;Zhan Liqun;Ren Xiaowei;Li Tao(Guangxi Meteorological Information Center,Nanning 530022)

机构地区:[1]广西区气象信息中心

出  处:《气象科技》2019年第5期768-772,871,共6页Meteorological Science and Technology

基  金:国家档案局项目(2016-X-06)“基于Hadoop大数据处理的广西气象数字档案馆建设”资助

摘  要:在Hadoop分布式计算和存储架构下,自定义ETL数据清洗规则将海量自动站小时单站文件按所属年和站号合并为大文件流转存储至HDFS中,并运用SparkSQL并行计算框架进行统计处理生成常用气象要素日统计值。结果表明,数据处理和获取时效较关系型数据库方式有显著提升。采用SparkSQL并行计算框架对多气象要素多站点和长时间序列进行数据统计处理查询均能达到秒级别响应,并随着统计站点数的不断增加和时间跨度的延长其优势更为明显,能更高效地支撑此类气象数据服务,为海量气象数据处理从关系型数据库到大数据分布式架构的转换处理提供了新思路。Under the distributed computing and storage framework of Hadoop,according to the customed ETL data cleaning rules,based on its year in which it belongs and station number,the hourly singlestation files of mass automatic station data are merged into large files and transferred to the distributed storage HDFS,using the Spark SQL parallel computation framework to deal with and produce the daily statistical values of common meteorological elements,which greatly improves data processing and acquisition efficiency compared with the relational database.The experimental results show that the data processing and querying of multiple meteorological elements,multi-site data and long-time series can reach the second level response by using the SparkSQL parallel computing framework,and its advantages are more obvious with the increasing number of statistical sites and the extension of time span.It can support this kind of meteorological data service more efficiently and provide new ideas for the transformation of large-scale meteorological data processing from relational database to large data distributed framework.

关 键 词:HADOOP HDFS SparkSQL ETL 

分 类 号:P409[天文地球—大气科学及气象学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象