ESDC:An open Earth science data corpus to support geoscientific literature information extraction

作　　者：Hao LI Peng YUE Deodato TAPETE Francesca CIGNA Qiuju WU Longgang XIANG Binbin LU

机构地区：[1]School of Remote Sensing and Information Engineering,Wuhan University,Wuhan 430079,China [2]Italian Space Agency,Rome 00133,Italy [3]National Research Council,Institute of Atmospheric Sciences and Climate,Rome 00133,Italy [4]School of Geography and Environmental Sciences,Zhejiang Normal University,Jinhua 321004,China [5]State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing,Wuhan University,Wuhan 430079,China

出　　处：《Science China Earth Sciences》2024年第12期3840-3854,共15页中国科学（地球科学英文版）

基　　金：supported by the National Natural Science Foundation of China(Grant No.42090011)。

摘　　要：Over the past ten years,large amounts of original research data related to Earth system science have been made available at a rapidly increasing rate.Such growing data stock helps researchers understand the human-Earth system across different fields.A substantial amount of this data is published by geoscientists as open-access in authoritative journals.If the information stored in this literature is properly extracted,there is significant potential to build a domain knowledge base.However,this potential remains largely unfulfilled in geoscience,with one of the biggest obstacles being the lack of publicly available related corpora and baselines.To fill this gap,the Earth Science Data Corpus(ESDC),an academic text corpus of 600 abstracts,was built from the international journal Earth System Science Data(ESSD).To the best of our knowledge,ESDC is the first corpus with the needed detail to provide a professional training dataset for knowledge extraction and construction of domain-specific knowledge graphs from massive amounts of literature.The production process of ESDC incorporates both the contextual features of spatiotemporal entities and the linguistic characteristics of academic literature.Furthermore,annotation guidelines and procedures tailored for Earth science data are formulated to ensure reliability.ChatGPT with zero-and few-shot prompting,BARTNER generative,and W2NER discriminative models were trained on ESDC to evaluate the performance of the name entity recognition task and showed increasing performance metrics,with the highest achieved by BARTNER.Performance metrics for various entity types output by each model were also assessed.We utilized the trained BARTNER model to perform model inference on a larger unlabeled literature corpus,aiming to automatically extract a broader and richer set of entity information.Subsequently,the extracted entity information was mapped and associated with the Earth science data knowledge graph.Around this knowledge graph,this paper validates multiple downstream applicatio

关键词：Earth science data CORPUS Information extraction Knowledge graph Scientometric research

分类号：G353.1[文化科学—情报学] P3[天文地球—地球物理学] P208

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

ESDC:An open Earth science data corpus to support geoscientific literature information extraction

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

ESDC:An open Earth science data corpus to support geoscientific literature information extraction

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索