面向主题场景的科技文献AI数据体系建设:技术框架研究与实践  

Construction of a Scientific Literature AI Data System for the Thematic Scenario:Technical Framework Research and Practice

在线阅读下载全文

作  者:常志军 钱力[1,2,3] 吴垚葶 曲云鹏 巩玥[1,2] 张智雄 CHANG Zhijun;QIAN Li;WU Yaoting;QU Yunpeng;GONG Yue;ZHANG Zhixiong(Documentation and Information Center,National Science Library,Chinese Academy of Sciences,Beijing 100190;Department of Information Resources Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190;Key Laboratory of New Publishing and Knowledge Services for Scholarly Journals,Beijing 100190)

机构地区:[1]中国科学院文献情报中心,北京100190 [2]中国科学院大学经济与管理学院信息资源管理系,北京100190 [3]国家新闻出版署学术期刊新型出版与知识服务重点实验室,北京100190

出  处:《农业图书情报学报》2024年第9期4-17,共14页Journal of Library and Information Science in Agriculture

基  金:国家社科基金项目“AI4S科技文献知识底座的理论体系及建设方法研究”(24BTQ043);国家社科基金项目“面向循证医学的领域文献实体关系识别方法研究”(21BTQ106)。

摘  要:[目的/意义]人工智能赋能科学研究已成为推动科学发现的重要驱动力。面向主题场景的高质量数据资源是训练高性能AI模型的关键,鉴于科技文献数据的复杂性及其直接用于大模型训练的局限性,亟须构建一套系统化的数据建设技术框架,通过对科技文献资源进行一系列的加工、提炼和整合,最终构建面向AI应用的高质量训练语料。[方法/过程]本研究提出了科技文献AI数据体系建设的“3+5技术框架”,围绕AI数据体系建设全流程,提炼设计了3个层次的数据内容,以及5个阶段的数据治理过程,基于大数据技术、智能挖掘技术作为数据治理的关键要素,详细阐述了数据治理工具链的体系架构与功能。[结果/结论]为验证所提出的技术框架的有效性,本研究将其应用于水稻育种领域的AI数据体系构建实践中。结果表明,该框架能够有效地处理科技文献数据,构建出了高质量的领域数据集,为AI模型在水稻育种研究中的应用提供了数据支撑,验证了该技术框架的有效性和实用性。[Purpose/Significance]Artificial intelligence is empowering scientific research and has become a major driver of scientific discovery.High-quality data resources for thematic scenarios are the key to training high-performance AI models.Given the complexity of scientific and technological(S&T)literature data and the limitations of its direct use for large-scale model training,there is a urgent need to build a systematic data construction technology framework to process,refine and curate S&T literature resources,and ultimately build a high-quality training corpus for AI applications.Some experts have conducted a number of studies,but there is still a lack of research on S&T literature AI data system for thematic scenarios.[Method/Process]This article proposes a"3+5 technical framework"plan for the construction of an AI data system for themed scenarios.Focusing on the whole process of AI data system construction,it refined and designed three levels of data content and five stages of data governance.The three-level data structure inclueds the multi-type basic database,the multi-model deconstruction database and fine-grained semantic mining knowledge base.The five-level construction stages are multi-channel data source scanning,multi-type basic data construction,multi-modal deconstruction data construction,fine-grained semantic mining knowledge construction and multi-scenario data application.Taking big data technology and intelligent mining technology as the key elements of data governance,the system architecture and functions of the data governance tool chain are described in detail.The core components of the tool chain are multi-source data aggregation tool,multi-format data parsing tool,data cleaning tool,associated file identification and acquisition tool,data fusion tool,multi-modal deconstruction and reorganization tool,and fine-grained knowledge identification tool.Working together,these tools ensure the efficiency and integrity of the design process from raw data to the AI data system.[Results/Conclusions]To

关 键 词:AI数据体系 多模态解构 语义标注数据 数据治理工具链 数据特征向量化 

分 类 号:G250.7[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象