基于分布式技术的科技文献大数据平台的建设研究  被引量:11

Big Data Platform for Sci-Tech Literature Based on Distributed Technology

在线阅读下载全文

作  者:常志军 钱力[1,2] 谢靖 吴振新[1,2] 张鹄 于倩倩[1] 王颖 王永吉 Chang Zhijun;Qian Liu;Xie Jing;Wu Zhenxin;Zhang Hu;Yu Qianqian;Wang Ying;Wang Yongji(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library Information and Archives Management,University of Chinese Academy of Sciences,Beijing 100190,China;Institute of Software,Chinese Academy of Sciences,Beijing 100190,China)

机构地区:[1]中国科学院文献情报中心,北京100190 [2]中国科学院大学经济与管理学院图书情报与档案管理系,北京100190 [3]中国科学院软件研究所,北京100190

出  处:《数据分析与知识发现》2021年第3期69-77,共9页Data Analysis and Knowledge Discovery

摘  要:【目的】解决海量篇级文献的存储与在线访问、大规模数据治理和服务性能低的问题,建设科技文献大数据平台。【方法】以分布式技术为基础,分析科技大数据特点及服务导向,结合服务器、网络等硬件资源条件,采用共租部署策略,设计了"5+2"整体架构的科技文献大数据平台。【结果】建成PB级科技文献大数据平台,数据存储量达到200TB,文献实体量达3.2亿条,实体关系量达60亿条,基于MapReduce的元数据处理性能提高3倍,形成了基于微服务的知识服务架构。【局限】该平台未设计完整的流式处理流程,不能满足增量数据即时响应的需求。【结论】本文建设的科技文献大数据平台已支撑中国科学院文献情报中心知识发现平台、慧科研等产品体系,取得较好的线上服务效果,提升了对科技文献数据的处理计算与服务能力。[Objective] This research addresses the issues facing the storage and online access of massive textlevel documents, the governance of large-scale data, and the low service performance, aiming to build a big data platform for sci-tech literature. [Methods] First, we analyzed the characteristics of distributed big data services for science and technology. Then, we adopted a co-tenant deployment strategy based on the servers and networks. Finally, we designed a big data platform for sci-tech literature with a"5+2"overall architecture.[Results] We established a PB-level big data platform for sci-tech literature. It has data storage capacity of 200 TB and collected 320 million document entities as well as 6 billion entity relationship. The metadata processing performance based on MapReduce was increased by 3 times, and then formed the knowledge service architecture based on new technology. [Limitations] We did not adequately process streaming data, thus the system cannot offer prompt response for new data. [Conclusions] The new platform supports the knowledge discovery services of National Science Library, Chinese Academy of Sciences, as well as the intelligent scientific research system. It has good online services and improves the processing and service capabilities of sci-tech literature.

关 键 词:大数据技术 分布式存储 分布式计算 共租部署 数据仓库 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论] G250[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象