检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:黄志 詹利群 任晓炜 李涛 Huang Zhi;Zhan Liqun;Ren Xiaowei;Li Tao(Guangxi Meteorological Information Center,Nanning 530022)
机构地区:[1]广西区气象信息中心
出 处:《气象科技》2019年第5期768-772,871,共6页Meteorological Science and Technology
基 金:国家档案局项目(2016-X-06)“基于Hadoop大数据处理的广西气象数字档案馆建设”资助
摘 要:在Hadoop分布式计算和存储架构下,自定义ETL数据清洗规则将海量自动站小时单站文件按所属年和站号合并为大文件流转存储至HDFS中,并运用SparkSQL并行计算框架进行统计处理生成常用气象要素日统计值。结果表明,数据处理和获取时效较关系型数据库方式有显著提升。采用SparkSQL并行计算框架对多气象要素多站点和长时间序列进行数据统计处理查询均能达到秒级别响应,并随着统计站点数的不断增加和时间跨度的延长其优势更为明显,能更高效地支撑此类气象数据服务,为海量气象数据处理从关系型数据库到大数据分布式架构的转换处理提供了新思路。Under the distributed computing and storage framework of Hadoop,according to the customed ETL data cleaning rules,based on its year in which it belongs and station number,the hourly singlestation files of mass automatic station data are merged into large files and transferred to the distributed storage HDFS,using the Spark SQL parallel computation framework to deal with and produce the daily statistical values of common meteorological elements,which greatly improves data processing and acquisition efficiency compared with the relational database.The experimental results show that the data processing and querying of multiple meteorological elements,multi-site data and long-time series can reach the second level response by using the SparkSQL parallel computing framework,and its advantages are more obvious with the increasing number of statistical sites and the extension of time span.It can support this kind of meteorological data service more efficiently and provide new ideas for the transformation of large-scale meteorological data processing from relational database to large data distributed framework.
关 键 词:HADOOP HDFS SparkSQL ETL
分 类 号:P409[天文地球—大气科学及气象学]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117