基于Labeled-LDA模型的科学数据与科技文献关联识别研究——以生物医学领域为例  被引量:4

Linkage Recognition Between Scientific Data and Scientific Literature Based on Labeled-LDA Model--Taking Biomedical Field as an Example

在线阅读下载全文

作  者:潘有能[1] 吕晶晶 丁楠[2] PAN Youneng;LV Jingjing;DING Nan(Department of Information Resources Management,School of Public Affairs,Hangzhou 310058,China;Zhejiang University Libraries,Zhejiang University,Hangzhou 310028,China)

机构地区:[1]浙江大学公共管理学院信息资源管理系,浙江杭州310058 [2]浙江大学图书馆,浙江杭州310027

出  处:《情报科学》2023年第9期138-145,154,共9页Information Science

基  金:浙江省哲学社会科学规划项目“基于引用网络的科学数据评价研究”(20NDJC039YB)。

摘  要:【目的/意义】在万物互联的开放科学时代,建立科学数据与科技文献之间的关联成为推动科学数据开放获取、共享和重用的重要举措。【方法/过程】本研究基于Labeled-LDA模型,辅以基于规则的识别方法,构建科学数据与科技文献关联识别模型,并以生物医学领域为例分别针对规范化引用、非规范化引用以及无引用三种关联情况进行模型训练与测试。【结果/结论】研究发现本模型在识别规范化引用测试集时识别率和F值分别为0.9和0.5左右,有比较稳定的识别效果,在识别非规范化引用和无引用的测试集时识别率分别为0.465和0.5,也展现出较强的可移植性与应用潜力。通过对非规范化引用和无引用识别结果进行人工判断,发现科学研究中确实存在数据引用不规范的现象,需要学界共同推动数据引用规范化。【创新/局限】与其他研究相比,本文构建的模型为基于语义的关联识别提供了方法层面的参考和基础,可以应用于大规模语料研究,从而促进更深层次语义关联的知识发现。【Purpose/significance】In the era of Open Science in which everything is interconnected,linking scientific data and scientific literature has become an important measure to promote the open access,acquisition,sharing and reuse of scientific data.【Method/process】In order to open up a solution path of identifying and extracting the hidden linkage between scientific data and scientific literature,this paper constructs the linkage recognition model between scientific data and scientific literature based on labeled-LDA model and rule-based recognition method.Taking biomedical papers and scientific data as the research object,this paper carries out model training and testing for the three association cases of standardized citation,non-standardized citation and no citation through text mining.【Result/conclusion】The results show that the F value of the model is about 0.5 when identifying the standardized reference test set,which has a relatively stable recognition effect.When identifying the non-standardized reference test set and the nonreferenced test set,the recognition rates are 0.465 and 0.5 respectively,showing strong portability and great application potential.Through the manual judgment of the recognition results of non-standardized references and non-references,it is found that there is indeed the phenomenon of non-standard data references in scientific research,which needs the academic community to jointly promote the standardization of data references.【Innovation/limitation】Compared with other studies,the model constructed in this paper provides a methodological reference and basis for semantic based association recognition,and can be applied to large-scale corpus research,so as to promote the knowledge discovery of deeper semantic association.

关 键 词:科学数据 科技文献 Labeled-LDA 关联识别 数据引用 

分 类 号:G250.2[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象