森林生态站大数据快速存储与索引方法  被引量:4

Fast Storage and Indexing Method of Big Data in Forest Ecological Station

在线阅读下载全文

作  者:王新阳 贾相宇 陈志泊[1,2] 崔晓晖 许福[1,2] WANG Xinyang;JIA Xiangyu;CHEN Zhibo;CUI Xiaohui;XU Fu(College of Information Science and Technology,Beijing Forestry University,Beijing 100083,China;Engineering Research Center for Forestry-oriented Intelligent Information Processing,National Forestry and Grassland Administration,Beijing 100083,China)

机构地区:[1]北京林业大学信息学院,北京100083 [2]国家林业和草原局林业智能信息处理工程技术研究中心,北京100083

出  处:《农业机械学报》2021年第8期195-204,212,共11页Transactions of the Chinese Society for Agricultural Machinery

基  金:中央高校基本科研业务费专项资金项目(BLX201923);国家自然科学基金项目(32071775)。

摘  要:针对森林生态站中大量图像、视频、GIS数据等非结构化数据以及生态指标等结构化数据存储效率低、检索性能差的问题,提出了基于Hadoop和HBase的森林生态站大数据存储框架。基于所提出的框架,给出了森林生态数据存储业务流程,并对森林生态大数据平台涉及的核心技术进行了优化:①设计预分区算法保证数据在集群中均匀分布。②根据生态数据特点科学设计了RowKey,实现生态数据的快速检索。③针对原生HBase不支持多条件查询问题,设计基于索引数据和服务器性能评估的ElasticSearch索引分片放置策略,以此基于ElasticSearch的二级非主键索引技术优化多条件检索HBase生态数据库。④针对生态站海量小图像存储困难问题,提出基于数据站点及时间关联性的打包合并策略。⑤解析GIS数据使之进行高效存储。通过实验对以上理论进行验证。结果表明,ElasticSearch索引分片放置策略比默认分片策略的查询时间平均减少20 ms,比基于改变ElasticSearch评分策略的查询时间平均减少20 ms。结构化数据规模为1×108条时,系统的检索时间为1.045 s,比原生HBase检索速度提升3.99倍,在非结构化数据为1×107条时,采用数据站点及时间关联性的打包小图像策略是基于SequenceFile合并效率的1.15倍,是原生HBase的1.79倍;在1×104次并发用户的情况下,优化后的每秒查询数是原来的1.88倍,每秒吞吐量是优化前的1.74倍,系统响应时间比优化前降低69.5%。结果表明,本文所提出的方案在集群负载均衡、海量结构化和非结构化数据检索效率以及系统吞吐量等方面都有了明显的性能提升,为森林生态数据的存储和管理提供了必要的理论基础和技术实现。Aiming at the problems of low storage efficiency and poor retrieval performance of a large number of unstructured data such as images,videos,GIS data and ecological indicators in the forest ecological station,a forest ecological station big data storage framework was proposed based on Hadoop and HBase.Based on the proposed framework,the business process of forest ecological data storage was given and the core technologies involved in the forest ecological big data platform was optimized.A pre-partitioning algorithm was designed to ensure that the data was evenly distributed in the cluster.According to the characteristics of ecological data,the RowKey was scientifically designed to achieve rapid retrieval of ecological data.Aiming at the problem that native HBase did not support multi-condition query,an ElasticSearch index shard placement strategy was designed based on index data and server performance evaluation,and the multi-condition search HBase ecological database was optimized based on ElasticSearch's secondary non-primary key index technology.In view of the difficulty of storing large amounts of small pictures in the ecological station,a package and merge strategy was proposed based on data sites and time relevance.GIS data was analyzed for efficient storage.The above theory was verified through experiments.The results showed that the ElasticSearch index shard placement strategy reduced the query time by an average of 20 ms compared with the default shard strategy.The average query time was reduced by 20 ms compared with that based on changing the ElasticSearch scoring policy.When the structured data size was 1×108,the retrieval time of the system was 1.045 s,which was 3.99 times faster than the native HBase retrieval,and when the unstructured data was 1×107 pieces,the based on data site and time correlation package small picture strategy was 1.15 times that of SequenceFile-based merging efficiency and 1.79 times that of native HBase.In the case of 1×104 concurrent users,after optimization,the number of

关 键 词:森林生态 大数据 快速存储 数据索引 分布式平台 

分 类 号:TP392[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象